Przemysław Węglik 17.06.2024
Użyłem datasetu: Most Streamed Spotify Songs 2023. Dane zawierają różnorodne informacje dotyczące najczęściej odsłuchiwamnych utworów dostępnych na Spotify w roku 2023. Zbiór danych obejmuje zarówno cechy utworów, jak i informacje o artystach oraz interakcjach użytkowników z muzyką. Poniżej znajduje się przykładowy przegląd tego, co zawierają dane:
Zbiór danych zawiera podstawowe informacje o utworach, takie jak identyfikator utworu, jego nazwa, nazwa artysty, nazwa albumu oraz data wydania. Szczególnie interesującą nas wartością jest liczba odsłuchań, którą będziemy próbowali przewidywać różnymi metodami.
Dane o artystach obejmują ich unikalne identyfikatory, nazwy, oraz gatunki muzyczne, które reprezentują.
Spotify dostarcza wiele danych opisujących cechy audio utworów. Na przykład, “danceability” (czyli jak dobrze utwór nadaje się do tańczenia), “energy” (energia utworu), klucz muzyczny, głośność, tryb (major lub minor), oraz “speechiness” (obecność słów mówionych w utworze). Inne przykłady to “acousticness” (akustyczność), “instrumentalness” (obecność instrumentalnych partii) oraz “liveness” (to czy utwór jest nagrany na żywo).
Zmieniamy nazwy kolumn i delikatnie czyścimy
dataframe = read.csv("spotify-2023.csv")
names(dataframe)[names(dataframe) == "danceability_."] <- "danceability"
names(dataframe)[names(dataframe) == "valence_."] <- "valence"
names(dataframe)[names(dataframe) == "energy_."] <- "energy"
names(dataframe)[names(dataframe) == "acousticness_."] <- "acousticness"
names(dataframe)[names(dataframe) == "instrumentalness_."] <- "instrumentalness"
names(dataframe)[names(dataframe) == "liveness_."] <- "liveness"
names(dataframe)[names(dataframe) == "speechiness_."] <- "speechiness"
dataframe = transform(dataframe, streams = as.numeric(streams))
## Warning in eval(substitute(list(...)), `_data`, parent.frame()): NAs introduced
## by coercion
dataframe = na.omit(dataframe)
head(dataframe)
## track_name artist.s._name artist_count
## 1 Seven (feat. Latto) (Explicit Ver.) Latto, Jung Kook 2
## 2 LALA Myke Towers 1
## 3 vampire Olivia Rodrigo 1
## 4 Cruel Summer Taylor Swift 1
## 5 WHERE SHE GOES Bad Bunny 1
## 6 Sprinter Dave, Central Cee 2
## released_year released_month released_day in_spotify_playlists
## 1 2023 7 14 553
## 2 2023 3 23 1474
## 3 2023 6 30 1397
## 4 2019 8 23 7858
## 5 2023 5 18 3133
## 6 2023 6 1 2186
## in_spotify_charts streams in_apple_playlists in_apple_charts
## 1 147 141381703 43 263
## 2 48 133716286 48 126
## 3 113 140003974 94 207
## 4 100 800840817 116 207
## 5 50 303236322 84 133
## 6 91 183706234 67 213
## in_deezer_playlists in_deezer_charts in_shazam_charts bpm key mode
## 1 45 10 826 125 B Major
## 2 58 14 382 92 C# Major
## 3 91 14 949 138 F Major
## 4 125 12 548 170 A Major
## 5 87 15 425 144 A Minor
## 6 88 17 946 141 C# Major
## danceability valence energy acousticness instrumentalness liveness
## 1 80 89 83 31 0 8
## 2 71 61 74 7 0 10
## 3 51 32 53 17 0 31
## 4 55 58 72 11 0 11
## 5 65 23 80 14 63 11
## 6 92 66 58 19 0 8
## speechiness
## 1 4
## 2 4
## 3 6
## 4 15
## 5 6
## 6 24
fit_ac <- lm(streams ~ acousticness, data = dataframe)
summary(fit_ac)
##
## Call:
## lm(formula = streams ~ acousticness, data = dataframe)
##
## Residuals:
## Min 1Q Median 3Q Max
## -514924519 -371670699 -222444207 157511754 3187110177
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 516784897 26546437 19.467 <2e-16 ***
## acousticness -97769 707306 -0.138 0.89
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 567100000 on 950 degrees of freedom
## Multiple R-squared: 2.011e-05, Adjusted R-squared: -0.001032
## F-statistic: 0.01911 on 1 and 950 DF, p-value: 0.8901
fit_lv <- lm(dataframe$streams ~ dataframe$liveness)
summary(fit_lv)
##
## Call:
## lm(formula = dataframe$streams ~ dataframe$liveness)
##
## Residuals:
## Min 1Q Median 3Q Max
## -528544084 -367994944 -225320538 143020359 3171353537
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 550517646 30528184 18.033 <2e-16 ***
## dataframe$liveness -1997346 1339063 -1.492 0.136
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 566500000 on 950 degrees of freedom
## Multiple R-squared: 0.002336, Adjusted R-squared: 0.001286
## F-statistic: 2.225 on 1 and 950 DF, p-value: 0.1361
fit_sp <- lm(dataframe$streams ~ dataframe$speechiness)
summary(fit_sp)
##
## Call:
## lm(formula = dataframe$streams ~ dataframe$speechiness)
##
## Residuals:
## Min 1Q Median 3Q Max
## -553557137 -368358821 -208002662 153643116 3169601189
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 579247918 26130426 22.168 < 2e-16 ***
## dataframe$speechiness -6422005 1843079 -3.484 0.000516 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 563600000 on 950 degrees of freedom
## Multiple R-squared: 0.01262, Adjusted R-squared: 0.01158
## F-statistic: 12.14 on 1 and 950 DF, p-value: 0.0005158
Prosta regresji na tle danych
{
plot = plot(dataframe$speechiness, dataframe$streams)
abline(fit_sp, col="red")
}
{
fit <- lm(dataframe$liveness ~ dataframe$energy)
plot = plot(dataframe$energy, dataframe$liveness)
abline(fit, col="red")
}
Liczba odsłuchań jest negatywnie skorelowana z liczbą słów w piosence. - energetyczność piosenki jest pozytywnie skorelowana z elementami wystąpień na żywo
fit_some <- lm(streams ~ key + energy + liveness, data = dataframe)
summary(fit_some)
##
## Call:
## lm(formula = streams ~ key + energy + liveness, data = dataframe)
##
## Residuals:
## Min 1Q Median 3Q Max
## -614216253 -370069060 -215158836 153375317 3094635971
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 614097404 93535709 6.565 8.58e-11 ***
## keyA -116820333 88036466 -1.327 0.185
## keyA# 24779965 95119622 0.261 0.795
## keyB 1222099 85925559 0.014 0.989
## keyC# 85096244 77964824 1.091 0.275
## keyD 8376519 85781441 0.098 0.922
## keyD# 28828680 114624798 0.252 0.801
## keyE 55584746 92624018 0.600 0.549
## keyF -53002660 83681170 -0.633 0.527
## keyF# 4296424 88347586 0.049 0.961
## keyG -70052102 82091254 -0.853 0.394
## keyG# -42406417 83205929 -0.510 0.610
## energy -913936 1127324 -0.811 0.418
## liveness -1868851 1352587 -1.382 0.167
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 567200000 on 938 degrees of freedom
## Multiple R-squared: 0.01246, Adjusted R-squared: -0.001227
## F-statistic: 0.9103 on 13 and 938 DF, p-value: 0.5417
Regresja bez niektórych zmiennych
fit_filtered <- lm(streams ~ . - track_name - artist.s._name - released_day - in_deezer_playlists - in_shazam_charts, data = dataframe)
summary(fit_filtered)
##
## Call:
## lm(formula = streams ~ . - track_name - artist.s._name - released_day -
## in_deezer_playlists - in_shazam_charts, data = dataframe)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.508e+09 -1.502e+08 -3.670e+07 1.113e+08 2.208e+09
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -6.612e+09 2.027e+09 -3.261 0.001149 **
## artist_count -3.454e+07 1.121e+07 -3.082 0.002116 **
## released_year 3.416e+06 1.006e+06 3.397 0.000711 ***
## released_month 3.842e+06 2.756e+06 1.394 0.163678
## in_spotify_playlists 3.566e+04 1.930e+03 18.475 < 2e-16 ***
## in_spotify_charts 3.546e+06 6.991e+05 5.072 4.75e-07 ***
## in_apple_playlists 2.929e+06 1.807e+05 16.211 < 2e-16 ***
## in_apple_charts -4.344e+05 2.470e+05 -1.759 0.078949 .
## in_deezer_charts -6.296e+06 2.145e+06 -2.935 0.003416 **
## bpm -5.486e+04 3.503e+05 -0.157 0.875573
## keyA 2.155e+06 4.627e+07 0.047 0.962869
## keyA# 6.729e+07 5.015e+07 1.342 0.180018
## keyB 2.290e+07 4.533e+07 0.505 0.613595
## keyC# 5.622e+05 4.122e+07 0.014 0.989122
## keyD -1.328e+07 4.486e+07 -0.296 0.767286
## keyD# 3.669e+07 6.040e+07 0.607 0.543688
## keyE 5.071e+07 4.970e+07 1.020 0.307822
## keyF 2.183e+07 4.402e+07 0.496 0.620138
## keyF# -2.444e+07 4.673e+07 -0.523 0.601035
## keyG -4.475e+07 4.276e+07 -1.047 0.295569
## keyG# 6.018e+06 4.368e+07 0.138 0.890445
## modeMinor 6.419e+06 2.079e+07 0.309 0.757626
## danceability -6.340e+05 7.986e+05 -0.794 0.427473
## valence -3.622e+05 5.071e+05 -0.714 0.475234
## energy -1.237e+06 7.839e+05 -1.578 0.114940
## acousticness 7.332e+05 4.769e+05 1.537 0.124563
## instrumentalness -1.016e+06 1.157e+06 -0.878 0.379913
## liveness 3.977e+05 7.101e+05 0.560 0.575609
## speechiness -1.312e+06 1.017e+06 -1.290 0.197265
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 294100000 on 923 degrees of freedom
## Multiple R-squared: 0.7388, Adjusted R-squared: 0.7309
## F-statistic: 93.25 on 28 and 923 DF, p-value: < 2.2e-16
mean(summary(fit_some)$residuals^2)
## [1] 3.169899e+17
mean(summary(fit_filtered)$residuals^2)
## [1] 8.383588e+16
Dołożenie większej liczby cech zmniejszyło średnio błąd kwadratowy kilkukrotnie.
summary(lm(streams ~ energy * liveness, data = dataframe))
##
## Call:
## lm(formula = streams ~ energy * liveness, data = dataframe)
##
## Residuals:
## Min 1Q Median 3Q Max
## -527081102 -371729201 -223431455 144319986 3177832375
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 530659683 125506028 4.228 2.58e-05 ***
## energy 214249 1835562 0.117 0.907
## liveness 1802778 5998489 0.301 0.764
## energy:liveness -52725 83289 -0.633 0.527
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 566900000 on 948 degrees of freedom
## Multiple R-squared: 0.003179, Adjusted R-squared: 2.45e-05
## F-statistic: 1.008 on 3 and 948 DF, p-value: 0.3885
fit_l2 <- lm(streams ~ bpm + I(bpm^2), data = dataframe)
summary(fit_l2)
##
## Call:
## lm(formula = streams ~ bpm + I(bpm^2), data = dataframe)
##
## Residuals:
## Min 1Q Median 3Q Max
## -545793552 -367775793 -224042960 166139165 3157050040
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1068629196 316595991 3.375 0.000767 ***
## bpm -9045609 5057220 -1.789 0.073990 .
## I(bpm^2) 35054 19540 1.794 0.073131 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 566500000 on 949 degrees of freedom
## Multiple R-squared: 0.003386, Adjusted R-squared: 0.001286
## F-statistic: 1.612 on 2 and 949 DF, p-value: 0.2
anova(fit_ac, fit_l2)
## Analysis of Variance Table
##
## Model 1: streams ~ acousticness
## Model 2: streams ~ bpm + I(bpm^2)
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 950 3.0558e+20
## 2 949 3.0455e+20 1 1.0285e+18 3.205 0.07373 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Regresja wielomianowa wyższego stopnia może wykorzystywać funkcję
poly()
fit_l5 <- lm(streams ~ poly(bpm, 5), data = dataframe)
summary(fit_l5)
##
## Call:
## lm(formula = streams ~ poly(bpm, 5), data = dataframe)
##
## Residuals:
## Min 1Q Median 3Q Max
## -647140156 -365012931 -217934211 136175023 3095450018
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.141e+08 1.828e+07 28.120 <2e-16 ***
## poly(bpm, 5)1 -4.262e+07 5.641e+08 -0.076 0.9398
## poly(bpm, 5)2 1.016e+09 5.641e+08 1.802 0.0719 .
## poly(bpm, 5)3 7.897e+08 5.641e+08 1.400 0.1619
## poly(bpm, 5)4 -1.114e+09 5.641e+08 -1.975 0.0486 *
## poly(bpm, 5)5 -1.274e+09 5.641e+08 -2.258 0.0242 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 564100000 on 946 degrees of freedom
## Multiple R-squared: 0.0148, Adjusted R-squared: 0.009589
## F-statistic: 2.841 on 5 and 946 DF, p-value: 0.01484
Logarytmiczna transformacja predyktora
summary(lm(streams ~ log(bpm), data = dataframe))
##
## Call:
## lm(formula = streams ~ log(bpm), data = dataframe)
##
## Residuals:
## Min 1Q Median 3Q Max
## -515505699 -371231452 -221469931 161070391 3197405857
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 615939052 383296408 1.607 0.108
## log(bpm) -21286853 80055568 -0.266 0.790
##
## Residual standard error: 567100000 on 950 degrees of freedom
## Multiple R-squared: 7.442e-05, Adjusted R-squared: -0.0009781
## F-statistic: 0.0707 on 1 and 950 DF, p-value: 0.7904
Stosowanie bardziej skomplikowanych przekształceń zwięszka RSE, a zatem jakość predykcji pogarsza się.
dataframe = read.csv("spotify-2023.csv")
names(dataframe)[names(dataframe) == "danceability_."] <- "danceability"
names(dataframe)[names(dataframe) == "valence_."] <- "valence"
names(dataframe)[names(dataframe) == "energy_."] <- "energy"
names(dataframe)[names(dataframe) == "acousticness_."] <- "acousticness"
names(dataframe)[names(dataframe) == "instrumentalness_."] <- "instrumentalness"
names(dataframe)[names(dataframe) == "liveness_."] <- "liveness"
names(dataframe)[names(dataframe) == "speechiness_."] <- "speechiness"
dataframe = transform(dataframe, streams = as.numeric(streams))
## Warning in eval(substitute(list(...)), `_data`, parent.frame()): NAs introduced
## by coercion
head(dataframe)
## track_name artist.s._name artist_count
## 1 Seven (feat. Latto) (Explicit Ver.) Latto, Jung Kook 2
## 2 LALA Myke Towers 1
## 3 vampire Olivia Rodrigo 1
## 4 Cruel Summer Taylor Swift 1
## 5 WHERE SHE GOES Bad Bunny 1
## 6 Sprinter Dave, Central Cee 2
## released_year released_month released_day in_spotify_playlists
## 1 2023 7 14 553
## 2 2023 3 23 1474
## 3 2023 6 30 1397
## 4 2019 8 23 7858
## 5 2023 5 18 3133
## 6 2023 6 1 2186
## in_spotify_charts streams in_apple_playlists in_apple_charts
## 1 147 141381703 43 263
## 2 48 133716286 48 126
## 3 113 140003974 94 207
## 4 100 800840817 116 207
## 5 50 303236322 84 133
## 6 91 183706234 67 213
## in_deezer_playlists in_deezer_charts in_shazam_charts bpm key mode
## 1 45 10 826 125 B Major
## 2 58 14 382 92 C# Major
## 3 91 14 949 138 F Major
## 4 125 12 548 170 A Major
## 5 87 15 425 144 A Minor
## 6 88 17 946 141 C# Major
## danceability valence energy acousticness instrumentalness liveness
## 1 80 89 83 31 0 8
## 2 71 61 74 7 0 10
## 3 51 32 53 17 0 31
## 4 55 58 72 11 0 11
## 5 65 23 80 14 63 11
## 6 92 66 58 19 0 8
## speechiness
## 1 4
## 2 4
## 3 6
## 4 15
## 5 6
## 6 24
library(class)
library(MASS)
library(nnet)
model <- multinom(key ~ mode + danceability + valence + energy + acousticness, data = dataframe)
## # weights: 84 (66 variable)
## initial value 2368.116037
## iter 10 value 2325.042927
## iter 20 value 2322.501362
## iter 30 value 2289.789215
## iter 40 value 2273.311980
## iter 50 value 2250.119530
## iter 60 value 2245.724806
## final value 2245.641183
## converged
summary(model)
## Call:
## multinom(formula = key ~ mode + danceability + valence + energy +
## acousticness, data = dataframe)
##
## Coefficients:
## (Intercept) modeMinor danceability valence energy
## A 1.29606649 1.1214705 -0.0004739765 -0.008824665 -1.833282e-02
## A# -0.82550988 1.3649660 0.0173909802 -0.005349928 -1.317957e-02
## B -2.50188219 1.5174331 0.0211078471 -0.010757154 1.474683e-02
## C# -0.56168512 0.8019744 0.0211956324 -0.014643921 3.208753e-03
## D -0.84797771 -0.2253672 0.0233053132 -0.011721178 -1.789800e-03
## D# -1.14176624 1.9424135 0.0015235406 -0.016029464 1.354838e-04
## E -1.71322646 2.3858951 0.0133787065 -0.032879557 1.006282e-02
## F -0.83124926 1.3189090 0.0093076635 -0.005296897 6.422626e-05
## F# -2.52137966 1.6311152 0.0062304434 0.006098420 1.023718e-02
## G -0.01306834 0.4715483 0.0121626795 -0.001725732 -9.276453e-03
## G# 0.72141622 0.4613253 0.0053704694 -0.001717764 -1.210428e-02
## acousticness
## A -0.0093504718
## A# -0.0074776880
## B -0.0016640732
## C# -0.0119998469
## D -0.0030400454
## D# -0.0008041040
## E 0.0074040615
## F -0.0009512604
## F# 0.0065047107
## G -0.0075410260
## G# -0.0134505680
##
## Std. Errors:
## (Intercept) modeMinor danceability valence energy acousticness
## A 1.127693 0.3484438 0.01187603 0.008001999 0.01252731 0.007560088
## A# 1.274097 0.3705158 0.01327752 0.008629937 0.01363357 0.008316290
## B 1.199332 0.3412472 0.01216630 0.007866253 0.01260662 0.007707144
## C# 1.054866 0.3181438 0.01081752 0.007128288 0.01126190 0.007036789
## D 1.145787 0.3849299 0.01181744 0.007791822 0.01226565 0.007433363
## D# 1.500188 0.4481345 0.01561686 0.010553201 0.01652615 0.009885615
## E 1.257575 0.3888066 0.01288685 0.008869717 0.01374536 0.008085948
## F 1.119000 0.3330552 0.01158835 0.007625619 0.01206428 0.007291477
## F# 1.221157 0.3506954 0.01253501 0.008157870 0.01290945 0.007750776
## G 1.085807 0.3381602 0.01126353 0.007411170 0.01171786 0.007152007
## G# 1.092456 0.3430049 0.01135511 0.007525217 0.01187366 0.007348994
##
## Residual Deviance: 4491.282
## AIC: 4623.282
Tryb (mode) mocno koreluje z niektórymi z kluczy, inne pojedyncze cechy również nieco korelują, przydałaby się tutaj normalizacja danych, aby lepiej porównać.
training_pred <- predict(model, dataframe, type="class")
accuracy <- mean(training_pred == dataframe$key)
accuracy
## [1] 0.1511018
Accuracy nie powala, ale klucz może byc czymś bardzo trudnym do przewidzenia - mamy 11 klas, model czegoś się jednak nauczył.
library(caTools)
trainIndex <- sample.split(dataframe$key, SplitRatio = 0.8)
train <- subset(dataframe, trainIndex)
test <- subset(dataframe, !trainIndex)
Regresję wykonujemy na podstawie zbioru uczącego
model2 <- multinom(key ~ mode + danceability + valence + energy + acousticness, data = train)
## # weights: 84 (66 variable)
## initial value 1895.983774
## iter 10 value 1863.977396
## iter 20 value 1859.999453
## iter 30 value 1848.445103
## iter 40 value 1831.596045
## iter 50 value 1798.027160
## iter 60 value 1794.070430
## final value 1793.866926
## converged
summary(model2)
## Call:
## multinom(formula = key ~ mode + danceability + valence + energy +
## acousticness, data = train)
##
## Coefficients:
## (Intercept) modeMinor danceability valence energy acousticness
## A 0.06575718 1.06495427 0.003170283 -0.008995063 -0.004549996 -0.003171147
## A# -1.10776674 1.41517000 0.018149726 -0.007248873 -0.008969794 -0.005833817
## B -1.98803425 1.40698964 0.021133889 -0.013757719 0.011553600 -0.004464912
## C# -1.37807325 0.84316696 0.026707953 -0.016777222 0.011310846 -0.012467825
## D -1.20770371 -0.08102582 0.031691509 -0.013433198 -0.004592226 -0.001786508
## D# -0.81974759 2.01047114 -0.001771200 -0.012882544 -0.003837304 -0.003311427
## E -2.00392420 2.35965625 0.014690998 -0.035410001 0.014592425 0.009491489
## F -1.53735721 1.25579711 0.011743409 -0.004245761 0.006465475 0.002730664
## F# -2.98189647 1.71646417 0.005506615 0.004932785 0.016547916 0.010229096
## G -0.02066264 0.28791529 0.020489468 -0.006370421 -0.011185816 -0.012902503
## G# 0.31858356 0.41110765 0.007904632 -0.001923161 -0.008984386 -0.011010900
##
## Std. Errors:
## (Intercept) modeMinor danceability valence energy acousticness
## A 1.295237 0.3898043 0.01347035 0.008975690 0.01399429 0.008455955
## A# 1.442345 0.4148728 0.01496439 0.009698123 0.01522954 0.009266060
## B 1.341437 0.3816138 0.01361867 0.008856463 0.01401051 0.008622799
## C# 1.218650 0.3564425 0.01236211 0.008069199 0.01271568 0.008066969
## D 1.310895 0.4186807 0.01354546 0.008768568 0.01374315 0.008305389
## D# 1.684605 0.5084480 0.01761437 0.011911359 0.01844870 0.011056641
## E 1.422020 0.4336138 0.01447390 0.009980614 0.01535764 0.008931496
## F 1.279674 0.3735852 0.01315609 0.008599255 0.01350568 0.008147586
## F# 1.390711 0.3954521 0.01417971 0.009239454 0.01446861 0.008655792
## G 1.241141 0.3836586 0.01284034 0.008371558 0.01312379 0.008169692
## G# 1.241791 0.3850781 0.01286565 0.008469311 0.01324551 0.008218378
##
## Residual Deviance: 3587.734
## AIC: 3719.734
a otrzymany model wykorzystujemy do predykcji dla danych ze zbioru testowego
testing_pred <- predict(model, test, type="class")
accuracy <- mean(testing_pred == test$key)
accuracy
## [1] 0.1421053
Accuracy nawet się poprawiło bo podziale na zbiór treningowy i uczący, byćmoże trafiliśmy na “łatwy” zbiór testowy.
train <- subset(dataframe, trainIndex)
test <- subset(dataframe, !trainIndex)
train = na.omit(train)[c("key", "danceability", "valence", "energy", "acousticness")]
test = na.omit(test)[c("key", "danceability", "valence", "energy", "acousticness")]
train_labels <- train$key
test_labels <- test$key
# Remove 'key' column from train and test
train$key <- NULL
test$key <- NULL
knn_result <- knn(train=train, test=test, cl=train_labels, k=3)
confusionMatrix <- table(pred=knn_result, true=test_labels)
accuracy <- sum(diag(confusionMatrix)) / sum(confusionMatrix)
print(accuracy)
## [1] 0.08947368
Gorze wyniki niż dla normalnej klasyfikacji, kNN słabo działa dal tego zbioru danych.
dataframe = read.csv("spotify-2023.csv")
names(dataframe)[names(dataframe) == "danceability_."] <- "danceability"
names(dataframe)[names(dataframe) == "valence_."] <- "valence"
names(dataframe)[names(dataframe) == "energy_."] <- "energy"
names(dataframe)[names(dataframe) == "acousticness_."] <- "acousticness"
names(dataframe)[names(dataframe) == "instrumentalness_."] <- "instrumentalness"
names(dataframe)[names(dataframe) == "liveness_."] <- "liveness"
names(dataframe)[names(dataframe) == "speechiness_."] <- "speechiness"
dataframe = transform(dataframe, streams = as.numeric(streams))
## Warning in eval(substitute(list(...)), `_data`, parent.frame()): NAs introduced
## by coercion
library(class)
library(MASS)
head(dataframe)
## track_name artist.s._name artist_count
## 1 Seven (feat. Latto) (Explicit Ver.) Latto, Jung Kook 2
## 2 LALA Myke Towers 1
## 3 vampire Olivia Rodrigo 1
## 4 Cruel Summer Taylor Swift 1
## 5 WHERE SHE GOES Bad Bunny 1
## 6 Sprinter Dave, Central Cee 2
## released_year released_month released_day in_spotify_playlists
## 1 2023 7 14 553
## 2 2023 3 23 1474
## 3 2023 6 30 1397
## 4 2019 8 23 7858
## 5 2023 5 18 3133
## 6 2023 6 1 2186
## in_spotify_charts streams in_apple_playlists in_apple_charts
## 1 147 141381703 43 263
## 2 48 133716286 48 126
## 3 113 140003974 94 207
## 4 100 800840817 116 207
## 5 50 303236322 84 133
## 6 91 183706234 67 213
## in_deezer_playlists in_deezer_charts in_shazam_charts bpm key mode
## 1 45 10 826 125 B Major
## 2 58 14 382 92 C# Major
## 3 91 14 949 138 F Major
## 4 125 12 548 170 A Major
## 5 87 15 425 144 A Minor
## 6 88 17 946 141 C# Major
## danceability valence energy acousticness instrumentalness liveness
## 1 80 89 83 31 0 8
## 2 71 61 74 7 0 10
## 3 51 32 53 17 0 31
## 4 55 58 72 11 0 11
## 5 65 23 80 14 63 11
## 6 92 66 58 19 0 8
## speechiness
## 1 4
## 2 4
## 3 6
## 4 15
## 5 6
## 6 24
Usuwamy NA
dataframe <- na.omit(dataframe)
Tworzymy zbiór uczący z połowy dostępnych obserwacji — reszta będzie
stanowić zbiór walidacyjny. Dla zapewnienia powtarzalności obliczeń
stosujemy funkcję set.seed.
set.seed(1)
n <- nrow(dataframe)
train <- sample(n, n / 2)
Dopasowujemy model liniowy na zbiorze uczącym, następnie obliczamy MSE dla zbioru walidacyjnego.
dataframe_lm <- lm(streams ~ liveness, data = dataframe, subset = train)
validation_set <- dataframe[-train,]
mse <- mean((validation_set$streams - predict(dataframe_lm, validation_set))^2)
mse
## [1] 2.818154e+17
Powtarzamy to samo dla regresji wielomianowej wyższych stopni
for (i in 2:5) {
dataframe_lm_poly <- lm(streams ~ poly(liveness, degree = i), data = dataframe,
subset = train)
print(mean((validation_set$streams - predict(dataframe_lm_poly, validation_set))^2))
}
## [1] 2.818228e+17
## [1] 2.822097e+17
## [1] 2.825869e+17
## [1] 2.829181e+17
Ciężko wnioskować liczbę odsłuchań na podstawie tego czy piosenka jest wyknoywana na żywo. A jeśli juz musimy to lepiej użyć niskiego współczynnika wielomianu.
Powtarzamy obliczenia dla innego zbioru walidacyjnego.
set.seed(2)
train <- sample(n, n / 2)
validation_set <- dataframe[-train,]
degree_max <- 5
mse <- rep(0, times = degree_max)
for (i in 1:degree_max) {
dataframe_lm <- lm(streams ~ poly(liveness, degree = i), data = dataframe, subset = train)
mse[i] <- mean((validation_set$streams - predict(dataframe_lm, validation_set))^2)
}
mse
## [1] 2.631586e+17 2.631819e+17 2.776190e+17 2.682317e+17 2.801860e+17
Nieco inne wyniki - co było oczekiwane
Otrzymane wyniki można zobrazować na wykresie
plot(mse, xlab = "Stopień wielomianu", ylab = "MSE", type = "b", pch = 20,
col = "blue")
Walidację krzyżową dla uogólnionych modeli liniowych wykonuje funkcja
cv.glm() z pakietu boot. Jej argumentem
(glmfit) jest obiekt klasy glm, więc jeśli
chcemy jej użyć do walidacji zwykłych modeli liniowych, musimy je
dopasowywać jako uogólnione modele liniowe (z
family = gaussian, co zresztą jest wartością domyślną).
Funkcja cv.glm() zwraca listę (zobacz
?cv.glm), której najbardziej interesującą składawą jest
delta — wektor o długości 2 zawierający estymatę błędu
predykcji w wersji oryginalnej i skorygowaną dla uwzględnienia
obciążenia wprowadzanego przez walidację krzyżową inną niż LOOCV.
library(boot)
compute_loocv_mse <- function(degree) {
dataframe_glm <- glm(streams ~ poly(liveness, degree), data = dataframe)
cv.glm(dataframe, dataframe_glm)$delta[1]
}
mse <- sapply(1:degree_max, compute_loocv_mse)
mse
## [1] 3.213943e+17 3.217088e+17 3.217095e+17 3.218216e+17 3.222182e+17
Można też narysować obrazek
plot(mse, xlab = "Stopień wielomianu", ylab = "LOOCV MSE", type = "b", pch = 20,
col = "blue")
[Co teraz z wnioskami na temat regresji wielomianowej w naszym przypadku?]
MSE jest jeszcze gorsze.
Podobnie korzystamy z funkcji cv.glm(), tylko teraz
jawnie ustawiamy parametr K oznaczający liczbę grup
(folds). Np. dla \(k = 10\)
wygląda to jak poniżej.
compute_kcv_mse <- function(degree, k) {
dataframe_glm <- glm(streams ~ poly(liveness, degree), data = dataframe)
cv.glm(dataframe, dataframe_glm, K = k)$delta[1]
}
mse <- sapply(1:degree_max, compute_kcv_mse, k = 10)
mse
## [1] 3.212015e+17 3.219672e+17 3.225988e+17 3.225096e+17 3.215037e+17
Oczywiście tym razem wyniki są losowe. Możemy zrobić ich zestawienie dla np. 10 prób.
mse10 <- replicate(10, sapply(1:degree_max, compute_kcv_mse, k = 10))
mse10
## [,1] [,2] [,3] [,4] [,5]
## [1,] 3.218311e+17 3.211233e+17 3.209605e+17 3.217964e+17 3.216730e+17
## [2,] 3.216456e+17 3.218086e+17 3.224292e+17 3.216497e+17 3.215884e+17
## [3,] 3.225961e+17 3.220367e+17 3.212582e+17 3.225169e+17 3.239123e+17
## [4,] 3.219497e+17 3.216228e+17 3.213630e+17 3.211560e+17 3.214825e+17
## [5,] 3.226594e+17 3.226924e+17 3.218692e+17 3.219549e+17 3.228773e+17
## [,6] [,7] [,8] [,9] [,10]
## [1,] 3.208841e+17 3.214266e+17 3.213332e+17 3.215641e+17 3.211343e+17
## [2,] 3.219493e+17 3.214436e+17 3.213963e+17 3.225785e+17 3.224823e+17
## [3,] 3.223226e+17 3.217683e+17 3.225869e+17 3.223603e+17 3.212580e+17
## [4,] 3.223637e+17 3.227963e+17 3.216905e+17 3.221527e+17 3.222117e+17
## [5,] 3.224161e+17 3.222716e+17 3.229107e+17 3.235770e+17 3.231393e+17
Nie mieliśmy pecha, po prostu dopasowywanie modeli gaussowskich tutaj nie działa. Prawdopodbnie przewidywanie liczby odsłuchań jest po prostu strasznie trudne.
Użyjemy metody bootstrap do oszacowania błędów standardowych
współczynników regresji liniowej. Podstawową funkcją jest tutaj
boot() z pakietu boot. Wymaga ona jako
parametru funkcji obliczającej interesującą statystykę dla podanego
zbioru danych. Ta ostatnia funkcja powinna akceptować dwa parametry:
zbiór danych oraz wektor indeksów (istnieją też inne możliwości:
?boot).
lm_coefs <- function(data, index = 1:nrow(data)) {
coef(lm(streams ~ liveness, data = dataframe, subset = index))
}
Funkcja lm_coefs() oblicza estymaty współczynników
regresji dla zbioru danych typu bootstrap utworzonego z
Auto:
n <- nrow(dataframe)
lm_coefs(dataframe, sample(n, n, replace = TRUE))
## (Intercept) liveness
## 482955426.4 -842727.7
Oczywiście jednym z takich zbiorów jest sam oryginał
lm_coefs(dataframe)
## (Intercept) liveness
## 550517646 -1997345
Obliczenie błędów standardowych metodą bootstrap z 1000 replikacji wygląda następująco.
boot(dataframe, lm_coefs, R = 1000)
##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = dataframe, statistic = lm_coefs, R = 1000)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* 550517646 -419644.789 29545083
## t2* -1997345 -6663.118 1147976
dataframe = read.csv("spotify-2023.csv")
names(dataframe)[names(dataframe) == "danceability_."] <- "danceability"
names(dataframe)[names(dataframe) == "valence_."] <- "valence"
names(dataframe)[names(dataframe) == "energy_."] <- "energy"
names(dataframe)[names(dataframe) == "acousticness_."] <- "acousticness"
names(dataframe)[names(dataframe) == "instrumentalness_."] <- "instrumentalness"
names(dataframe)[names(dataframe) == "liveness_."] <- "liveness"
names(dataframe)[names(dataframe) == "speechiness_."] <- "speechiness"
dataframe = transform(dataframe, streams = as.numeric(streams))
## Warning in eval(substitute(list(...)), `_data`, parent.frame()): NAs introduced
## by coercion
head(dataframe)
## track_name artist.s._name artist_count
## 1 Seven (feat. Latto) (Explicit Ver.) Latto, Jung Kook 2
## 2 LALA Myke Towers 1
## 3 vampire Olivia Rodrigo 1
## 4 Cruel Summer Taylor Swift 1
## 5 WHERE SHE GOES Bad Bunny 1
## 6 Sprinter Dave, Central Cee 2
## released_year released_month released_day in_spotify_playlists
## 1 2023 7 14 553
## 2 2023 3 23 1474
## 3 2023 6 30 1397
## 4 2019 8 23 7858
## 5 2023 5 18 3133
## 6 2023 6 1 2186
## in_spotify_charts streams in_apple_playlists in_apple_charts
## 1 147 141381703 43 263
## 2 48 133716286 48 126
## 3 113 140003974 94 207
## 4 100 800840817 116 207
## 5 50 303236322 84 133
## 6 91 183706234 67 213
## in_deezer_playlists in_deezer_charts in_shazam_charts bpm key mode
## 1 45 10 826 125 B Major
## 2 58 14 382 92 C# Major
## 3 91 14 949 138 F Major
## 4 125 12 548 170 A Major
## 5 87 15 425 144 A Minor
## 6 88 17 946 141 C# Major
## danceability valence energy acousticness instrumentalness liveness
## 1 80 89 83 31 0 8
## 2 71 61 74 7 0 10
## 3 51 32 53 17 0 31
## 4 55 58 72 11 0 11
## 5 65 23 80 14 63 11
## 6 92 66 58 19 0 8
## speechiness
## 1 4
## 2 4
## 3 6
## 4 15
## 5 6
## 6 24
dataframe <- na.omit(dataframe)
Metody selekcji cech są zaimplementowane w funkcji
regsubsets() z pakietu leaps.
dataframe_bs <- regsubsets(streams ~ . - track_name - artist.s._name - released_day - in_deezer_playlists - in_shazam_charts, data = dataframe, really.big=T)
summary(dataframe_bs)
## Subset selection object
## Call: regsubsets.formula(streams ~ . - track_name - artist.s._name -
## released_day - in_deezer_playlists - in_shazam_charts, data = dataframe,
## really.big = T)
## 28 Variables (and intercept)
## Forced in Forced out
## artist_count FALSE FALSE
## released_year FALSE FALSE
## released_month FALSE FALSE
## in_spotify_playlists FALSE FALSE
## in_spotify_charts FALSE FALSE
## in_apple_playlists FALSE FALSE
## in_apple_charts FALSE FALSE
## in_deezer_charts FALSE FALSE
## bpm FALSE FALSE
## keyA FALSE FALSE
## keyA# FALSE FALSE
## keyB FALSE FALSE
## keyC# FALSE FALSE
## keyD FALSE FALSE
## keyD# FALSE FALSE
## keyE FALSE FALSE
## keyF FALSE FALSE
## keyF# FALSE FALSE
## keyG FALSE FALSE
## keyG# FALSE FALSE
## modeMinor FALSE FALSE
## danceability FALSE FALSE
## valence FALSE FALSE
## energy FALSE FALSE
## acousticness FALSE FALSE
## instrumentalness FALSE FALSE
## liveness FALSE FALSE
## speechiness FALSE FALSE
## 1 subsets of each size up to 8
## Selection Algorithm: exhaustive
## artist_count released_year released_month in_spotify_playlists
## 1 ( 1 ) " " " " " " "*"
## 2 ( 1 ) " " " " " " "*"
## 3 ( 1 ) " " " " " " "*"
## 4 ( 1 ) " " " " " " "*"
## 5 ( 1 ) "*" " " " " "*"
## 6 ( 1 ) "*" "*" " " "*"
## 7 ( 1 ) "*" "*" " " "*"
## 8 ( 1 ) "*" "*" " " "*"
## in_spotify_charts in_apple_playlists in_apple_charts in_deezer_charts
## 1 ( 1 ) " " " " " " " "
## 2 ( 1 ) " " "*" " " " "
## 3 ( 1 ) "*" "*" " " " "
## 4 ( 1 ) "*" "*" " " " "
## 5 ( 1 ) "*" "*" " " " "
## 6 ( 1 ) "*" "*" " " " "
## 7 ( 1 ) "*" "*" " " "*"
## 8 ( 1 ) "*" "*" " " "*"
## bpm keyA keyA# keyB keyC# keyD keyD# keyE keyF keyF# keyG keyG#
## 1 ( 1 ) " " " " " " " " " " " " " " " " " " " " " " " "
## 2 ( 1 ) " " " " " " " " " " " " " " " " " " " " " " " "
## 3 ( 1 ) " " " " " " " " " " " " " " " " " " " " " " " "
## 4 ( 1 ) " " " " " " " " " " " " " " " " " " " " " " " "
## 5 ( 1 ) " " " " " " " " " " " " " " " " " " " " " " " "
## 6 ( 1 ) " " " " " " " " " " " " " " " " " " " " " " " "
## 7 ( 1 ) " " " " " " " " " " " " " " " " " " " " " " " "
## 8 ( 1 ) " " " " " " " " " " " " " " " " " " " " "*" " "
## modeMinor danceability valence energy acousticness instrumentalness
## 1 ( 1 ) " " " " " " " " " " " "
## 2 ( 1 ) " " " " " " " " " " " "
## 3 ( 1 ) " " " " " " " " " " " "
## 4 ( 1 ) " " " " " " "*" " " " "
## 5 ( 1 ) " " " " " " "*" " " " "
## 6 ( 1 ) " " " " " " "*" " " " "
## 7 ( 1 ) " " " " " " "*" " " " "
## 8 ( 1 ) " " " " " " "*" " " " "
## liveness speechiness
## 1 ( 1 ) " " " "
## 2 ( 1 ) " " " "
## 3 ( 1 ) " " " "
## 4 ( 1 ) " " " "
## 5 ( 1 ) " " " "
## 6 ( 1 ) " " " "
## 7 ( 1 ) " " " "
## 8 ( 1 ) " " " "
Jak widzimy najistotniejszymi cechami jest to w ilu playlistach znajuje się na różnych platformach. Ważna okazuje się także energia, liczba artystów którzy pracowali przy piosence i co zaskakujące, to czy jest w kluczu G!
Jak można zobaczyć, funkcja regsubsets() domyślnie
uwzględnia maksymalnie 8 predyktorów. Jeśli chcemy to zmienić, musimy
użyć parametru nvmax.
dataframe_bs <- regsubsets(streams ~ . - track_name - artist.s._name - released_day - in_deezer_playlists - in_shazam_charts, data = dataframe, nvmax = 19, really.big=T)
dataframe_bs_sum <- summary(dataframe_bs)
dataframe_bs_sum
## Subset selection object
## Call: regsubsets.formula(streams ~ . - track_name - artist.s._name -
## released_day - in_deezer_playlists - in_shazam_charts, data = dataframe,
## nvmax = 19, really.big = T)
## 28 Variables (and intercept)
## Forced in Forced out
## artist_count FALSE FALSE
## released_year FALSE FALSE
## released_month FALSE FALSE
## in_spotify_playlists FALSE FALSE
## in_spotify_charts FALSE FALSE
## in_apple_playlists FALSE FALSE
## in_apple_charts FALSE FALSE
## in_deezer_charts FALSE FALSE
## bpm FALSE FALSE
## keyA FALSE FALSE
## keyA# FALSE FALSE
## keyB FALSE FALSE
## keyC# FALSE FALSE
## keyD FALSE FALSE
## keyD# FALSE FALSE
## keyE FALSE FALSE
## keyF FALSE FALSE
## keyF# FALSE FALSE
## keyG FALSE FALSE
## keyG# FALSE FALSE
## modeMinor FALSE FALSE
## danceability FALSE FALSE
## valence FALSE FALSE
## energy FALSE FALSE
## acousticness FALSE FALSE
## instrumentalness FALSE FALSE
## liveness FALSE FALSE
## speechiness FALSE FALSE
## 1 subsets of each size up to 19
## Selection Algorithm: exhaustive
## artist_count released_year released_month in_spotify_playlists
## 1 ( 1 ) " " " " " " "*"
## 2 ( 1 ) " " " " " " "*"
## 3 ( 1 ) " " " " " " "*"
## 4 ( 1 ) " " " " " " "*"
## 5 ( 1 ) "*" " " " " "*"
## 6 ( 1 ) "*" "*" " " "*"
## 7 ( 1 ) "*" "*" " " "*"
## 8 ( 1 ) "*" "*" " " "*"
## 9 ( 1 ) "*" "*" " " "*"
## 10 ( 1 ) "*" "*" " " "*"
## 11 ( 1 ) "*" "*" "*" "*"
## 12 ( 1 ) "*" "*" " " "*"
## 13 ( 1 ) "*" "*" "*" "*"
## 14 ( 1 ) "*" "*" "*" "*"
## 15 ( 1 ) "*" "*" "*" "*"
## 16 ( 1 ) "*" "*" "*" "*"
## 17 ( 1 ) "*" "*" "*" "*"
## 18 ( 1 ) "*" "*" "*" "*"
## 19 ( 1 ) "*" "*" "*" "*"
## in_spotify_charts in_apple_playlists in_apple_charts in_deezer_charts
## 1 ( 1 ) " " " " " " " "
## 2 ( 1 ) " " "*" " " " "
## 3 ( 1 ) "*" "*" " " " "
## 4 ( 1 ) "*" "*" " " " "
## 5 ( 1 ) "*" "*" " " " "
## 6 ( 1 ) "*" "*" " " " "
## 7 ( 1 ) "*" "*" " " "*"
## 8 ( 1 ) "*" "*" " " "*"
## 9 ( 1 ) "*" "*" " " "*"
## 10 ( 1 ) "*" "*" "*" "*"
## 11 ( 1 ) "*" "*" "*" "*"
## 12 ( 1 ) "*" "*" "*" "*"
## 13 ( 1 ) "*" "*" "*" "*"
## 14 ( 1 ) "*" "*" "*" "*"
## 15 ( 1 ) "*" "*" "*" "*"
## 16 ( 1 ) "*" "*" "*" "*"
## 17 ( 1 ) "*" "*" "*" "*"
## 18 ( 1 ) "*" "*" "*" "*"
## 19 ( 1 ) "*" "*" "*" "*"
## bpm keyA keyA# keyB keyC# keyD keyD# keyE keyF keyF# keyG keyG#
## 1 ( 1 ) " " " " " " " " " " " " " " " " " " " " " " " "
## 2 ( 1 ) " " " " " " " " " " " " " " " " " " " " " " " "
## 3 ( 1 ) " " " " " " " " " " " " " " " " " " " " " " " "
## 4 ( 1 ) " " " " " " " " " " " " " " " " " " " " " " " "
## 5 ( 1 ) " " " " " " " " " " " " " " " " " " " " " " " "
## 6 ( 1 ) " " " " " " " " " " " " " " " " " " " " " " " "
## 7 ( 1 ) " " " " " " " " " " " " " " " " " " " " " " " "
## 8 ( 1 ) " " " " " " " " " " " " " " " " " " " " "*" " "
## 9 ( 1 ) " " " " " " " " " " " " " " " " " " " " "*" " "
## 10 ( 1 ) " " " " " " " " " " " " " " " " " " " " "*" " "
## 11 ( 1 ) " " " " " " " " " " " " " " " " " " " " "*" " "
## 12 ( 1 ) " " " " "*" " " " " " " " " "*" " " " " "*" " "
## 13 ( 1 ) " " " " "*" " " " " " " " " " " " " " " "*" " "
## 14 ( 1 ) " " " " "*" " " " " " " " " "*" " " " " "*" " "
## 15 ( 1 ) " " " " "*" " " " " " " " " "*" " " " " "*" " "
## 16 ( 1 ) " " " " "*" " " " " " " " " "*" " " "*" "*" " "
## 17 ( 1 ) " " " " "*" " " " " " " " " "*" " " "*" "*" " "
## 18 ( 1 ) " " " " "*" " " " " " " " " "*" " " "*" "*" " "
## 19 ( 1 ) " " " " "*" " " " " "*" " " "*" " " "*" "*" " "
## modeMinor danceability valence energy acousticness instrumentalness
## 1 ( 1 ) " " " " " " " " " " " "
## 2 ( 1 ) " " " " " " " " " " " "
## 3 ( 1 ) " " " " " " " " " " " "
## 4 ( 1 ) " " " " " " "*" " " " "
## 5 ( 1 ) " " " " " " "*" " " " "
## 6 ( 1 ) " " " " " " "*" " " " "
## 7 ( 1 ) " " " " " " "*" " " " "
## 8 ( 1 ) " " " " " " "*" " " " "
## 9 ( 1 ) " " "*" " " "*" " " " "
## 10 ( 1 ) " " "*" " " "*" " " " "
## 11 ( 1 ) " " "*" " " "*" " " " "
## 12 ( 1 ) " " "*" " " "*" " " " "
## 13 ( 1 ) " " " " " " "*" "*" " "
## 14 ( 1 ) " " " " " " "*" "*" " "
## 15 ( 1 ) " " "*" " " "*" "*" " "
## 16 ( 1 ) " " "*" " " "*" "*" " "
## 17 ( 1 ) " " "*" " " "*" "*" "*"
## 18 ( 1 ) " " "*" "*" "*" "*" "*"
## 19 ( 1 ) " " "*" "*" "*" "*" "*"
## liveness speechiness
## 1 ( 1 ) " " " "
## 2 ( 1 ) " " " "
## 3 ( 1 ) " " " "
## 4 ( 1 ) " " " "
## 5 ( 1 ) " " " "
## 6 ( 1 ) " " " "
## 7 ( 1 ) " " " "
## 8 ( 1 ) " " " "
## 9 ( 1 ) " " " "
## 10 ( 1 ) " " " "
## 11 ( 1 ) " " " "
## 12 ( 1 ) " " " "
## 13 ( 1 ) " " "*"
## 14 ( 1 ) " " "*"
## 15 ( 1 ) " " "*"
## 16 ( 1 ) " " "*"
## 17 ( 1 ) " " "*"
## 18 ( 1 ) " " "*"
## 19 ( 1 ) " " "*"
Widzimy, że w większości klucz nie ma pwływu na liczbę odsłuchań. Co ciekawe nieistotne jest też czy piosenka była wykonywana na żywo, a takze to czy jest pozytywna czy negatywna (valance)
Obiekt zwracany przez funkcję summary.regsubsets()
zawiera informacje umożliwiające zidentyfikowanie globalnie najlepszego
pozdbioru cech, np. miarę \(C_p\).
dataframe_bs_sum$cp
## [1] 381.410414 62.707579 48.272729 32.812311 24.061786 14.880204
## [7] 8.850438 7.597920 6.910704 6.168598 6.089836 6.078463
## [13] 6.000602 6.205914 6.745193 8.015665 9.375488 10.881186
## [19] 12.397510
Najlepszy podzbiór według kryterium BIC
bic_min <- which.min(dataframe_bs_sum$bic)
bic_min
## [1] 7
dataframe_bs_sum$bic[bic_min]
## [1] -1200.961
Stosowny obrazek
{
plot(dataframe_bs_sum$bic, xlab = "Liczba zmiennych", ylab = "BIC", col = "green",
type = "b", pch = 20)
points(bic_min, dataframe_bs_sum$bic[bic_min], col = "red", pch = 9)
}
Najlepiej zachować około 7 cech, dokładanie kolejnych pogarasza model.
Dostępny jest też specjalny rodzaj wykresu
(?plot.regsubsets).
plot(dataframe_bs, scale = "bic")
Estymaty współczynników dla optymalnego podzbioru
coef(dataframe_bs, id = 6)
## (Intercept) artist_count released_year
## -6.185492e+09 -3.673632e+07 3.226564e+06
## in_spotify_playlists in_spotify_charts in_apple_playlists
## 3.669117e+04 1.944172e+06 2.671990e+06
## energy
## -2.348066e+06
[Zrób podobną analizę dla innych kryteriów optymalności: \(C_p\) i poprawionego \(R^2\). Zwróć uwagę na to, że poprawione \(R^2\) powinno być zmaksymalizowane.]
Najlepszy podzbiór według kryterium \(C_p\)
cp_min <- which.min(dataframe_bs_sum$cp)
cp_min
## [1] 13
dataframe_bs_sum$cp[cp_min]
## [1] 6.000602
Stosowny obrazek
{
plot(dataframe_bs_sum$cp, xlab = "Liczba zmiennych", ylab = "BIC", col = "green",
type = "b", pch = 20)
points(cp_min, dataframe_bs_sum$cp[cp_min], col = "red", pch = 9)
}
Dostępny jest też specjalny rodzaj wykresu
(?plot.regsubsets).
plot(dataframe_bs, scale = "Cp")
Estymaty współczynników dla optymalnego podzbioru
coef(dataframe_bs, id = 6)
## (Intercept) artist_count released_year
## -6.185492e+09 -3.673632e+07 3.226564e+06
## in_spotify_playlists in_spotify_charts in_apple_playlists
## 3.669117e+04 1.944172e+06 2.671990e+06
## energy
## -2.348066e+06
Funkcja regsubsets() z odpowiednio ustawionym parametrem
method może przeprowadzić selekcję krokową.
dataframe_fwd <- regsubsets(streams ~ . - track_name - artist.s._name - released_day - in_deezer_playlists - in_shazam_charts, data = dataframe, nvmax = 19,
method = "forward")
dataframe_fwd_sum <- summary(dataframe_fwd)
dataframe_fwd_sum
## Subset selection object
## Call: regsubsets.formula(streams ~ . - track_name - artist.s._name -
## released_day - in_deezer_playlists - in_shazam_charts, data = dataframe,
## nvmax = 19, method = "forward")
## 28 Variables (and intercept)
## Forced in Forced out
## artist_count FALSE FALSE
## released_year FALSE FALSE
## released_month FALSE FALSE
## in_spotify_playlists FALSE FALSE
## in_spotify_charts FALSE FALSE
## in_apple_playlists FALSE FALSE
## in_apple_charts FALSE FALSE
## in_deezer_charts FALSE FALSE
## bpm FALSE FALSE
## keyA FALSE FALSE
## keyA# FALSE FALSE
## keyB FALSE FALSE
## keyC# FALSE FALSE
## keyD FALSE FALSE
## keyD# FALSE FALSE
## keyE FALSE FALSE
## keyF FALSE FALSE
## keyF# FALSE FALSE
## keyG FALSE FALSE
## keyG# FALSE FALSE
## modeMinor FALSE FALSE
## danceability FALSE FALSE
## valence FALSE FALSE
## energy FALSE FALSE
## acousticness FALSE FALSE
## instrumentalness FALSE FALSE
## liveness FALSE FALSE
## speechiness FALSE FALSE
## 1 subsets of each size up to 19
## Selection Algorithm: forward
## artist_count released_year released_month in_spotify_playlists
## 1 ( 1 ) " " " " " " "*"
## 2 ( 1 ) " " " " " " "*"
## 3 ( 1 ) " " " " " " "*"
## 4 ( 1 ) " " " " " " "*"
## 5 ( 1 ) "*" " " " " "*"
## 6 ( 1 ) "*" "*" " " "*"
## 7 ( 1 ) "*" "*" " " "*"
## 8 ( 1 ) "*" "*" " " "*"
## 9 ( 1 ) "*" "*" " " "*"
## 10 ( 1 ) "*" "*" " " "*"
## 11 ( 1 ) "*" "*" "*" "*"
## 12 ( 1 ) "*" "*" "*" "*"
## 13 ( 1 ) "*" "*" "*" "*"
## 14 ( 1 ) "*" "*" "*" "*"
## 15 ( 1 ) "*" "*" "*" "*"
## 16 ( 1 ) "*" "*" "*" "*"
## 17 ( 1 ) "*" "*" "*" "*"
## 18 ( 1 ) "*" "*" "*" "*"
## 19 ( 1 ) "*" "*" "*" "*"
## in_spotify_charts in_apple_playlists in_apple_charts in_deezer_charts
## 1 ( 1 ) " " " " " " " "
## 2 ( 1 ) " " "*" " " " "
## 3 ( 1 ) "*" "*" " " " "
## 4 ( 1 ) "*" "*" " " " "
## 5 ( 1 ) "*" "*" " " " "
## 6 ( 1 ) "*" "*" " " " "
## 7 ( 1 ) "*" "*" " " "*"
## 8 ( 1 ) "*" "*" " " "*"
## 9 ( 1 ) "*" "*" " " "*"
## 10 ( 1 ) "*" "*" "*" "*"
## 11 ( 1 ) "*" "*" "*" "*"
## 12 ( 1 ) "*" "*" "*" "*"
## 13 ( 1 ) "*" "*" "*" "*"
## 14 ( 1 ) "*" "*" "*" "*"
## 15 ( 1 ) "*" "*" "*" "*"
## 16 ( 1 ) "*" "*" "*" "*"
## 17 ( 1 ) "*" "*" "*" "*"
## 18 ( 1 ) "*" "*" "*" "*"
## 19 ( 1 ) "*" "*" "*" "*"
## bpm keyA keyA# keyB keyC# keyD keyD# keyE keyF keyF# keyG keyG#
## 1 ( 1 ) " " " " " " " " " " " " " " " " " " " " " " " "
## 2 ( 1 ) " " " " " " " " " " " " " " " " " " " " " " " "
## 3 ( 1 ) " " " " " " " " " " " " " " " " " " " " " " " "
## 4 ( 1 ) " " " " " " " " " " " " " " " " " " " " " " " "
## 5 ( 1 ) " " " " " " " " " " " " " " " " " " " " " " " "
## 6 ( 1 ) " " " " " " " " " " " " " " " " " " " " " " " "
## 7 ( 1 ) " " " " " " " " " " " " " " " " " " " " " " " "
## 8 ( 1 ) " " " " " " " " " " " " " " " " " " " " "*" " "
## 9 ( 1 ) " " " " " " " " " " " " " " " " " " " " "*" " "
## 10 ( 1 ) " " " " " " " " " " " " " " " " " " " " "*" " "
## 11 ( 1 ) " " " " " " " " " " " " " " " " " " " " "*" " "
## 12 ( 1 ) " " " " "*" " " " " " " " " " " " " " " "*" " "
## 13 ( 1 ) " " " " "*" " " " " " " " " "*" " " " " "*" " "
## 14 ( 1 ) " " " " "*" " " " " " " " " "*" " " " " "*" " "
## 15 ( 1 ) " " " " "*" " " " " " " " " "*" " " " " "*" " "
## 16 ( 1 ) " " " " "*" " " " " " " " " "*" " " "*" "*" " "
## 17 ( 1 ) " " " " "*" " " " " " " " " "*" " " "*" "*" " "
## 18 ( 1 ) " " " " "*" " " " " " " " " "*" " " "*" "*" " "
## 19 ( 1 ) " " " " "*" " " " " "*" " " "*" " " "*" "*" " "
## modeMinor danceability valence energy acousticness instrumentalness
## 1 ( 1 ) " " " " " " " " " " " "
## 2 ( 1 ) " " " " " " " " " " " "
## 3 ( 1 ) " " " " " " " " " " " "
## 4 ( 1 ) " " " " " " "*" " " " "
## 5 ( 1 ) " " " " " " "*" " " " "
## 6 ( 1 ) " " " " " " "*" " " " "
## 7 ( 1 ) " " " " " " "*" " " " "
## 8 ( 1 ) " " " " " " "*" " " " "
## 9 ( 1 ) " " "*" " " "*" " " " "
## 10 ( 1 ) " " "*" " " "*" " " " "
## 11 ( 1 ) " " "*" " " "*" " " " "
## 12 ( 1 ) " " "*" " " "*" " " " "
## 13 ( 1 ) " " "*" " " "*" " " " "
## 14 ( 1 ) " " "*" " " "*" "*" " "
## 15 ( 1 ) " " "*" " " "*" "*" " "
## 16 ( 1 ) " " "*" " " "*" "*" " "
## 17 ( 1 ) " " "*" " " "*" "*" "*"
## 18 ( 1 ) " " "*" "*" "*" "*" "*"
## 19 ( 1 ) " " "*" "*" "*" "*" "*"
## liveness speechiness
## 1 ( 1 ) " " " "
## 2 ( 1 ) " " " "
## 3 ( 1 ) " " " "
## 4 ( 1 ) " " " "
## 5 ( 1 ) " " " "
## 6 ( 1 ) " " " "
## 7 ( 1 ) " " " "
## 8 ( 1 ) " " " "
## 9 ( 1 ) " " " "
## 10 ( 1 ) " " " "
## 11 ( 1 ) " " " "
## 12 ( 1 ) " " " "
## 13 ( 1 ) " " " "
## 14 ( 1 ) " " " "
## 15 ( 1 ) " " "*"
## 16 ( 1 ) " " "*"
## 17 ( 1 ) " " "*"
## 18 ( 1 ) " " "*"
## 19 ( 1 ) " " "*"
dataframe_back <- regsubsets(streams ~ . - track_name - artist.s._name - released_day - in_deezer_playlists - in_shazam_charts, data = dataframe, nvmax = 19,
method = "backward")
dataframe_back_sum <- summary(dataframe_back)
dataframe_back_sum
## Subset selection object
## Call: regsubsets.formula(streams ~ . - track_name - artist.s._name -
## released_day - in_deezer_playlists - in_shazam_charts, data = dataframe,
## nvmax = 19, method = "backward")
## 28 Variables (and intercept)
## Forced in Forced out
## artist_count FALSE FALSE
## released_year FALSE FALSE
## released_month FALSE FALSE
## in_spotify_playlists FALSE FALSE
## in_spotify_charts FALSE FALSE
## in_apple_playlists FALSE FALSE
## in_apple_charts FALSE FALSE
## in_deezer_charts FALSE FALSE
## bpm FALSE FALSE
## keyA FALSE FALSE
## keyA# FALSE FALSE
## keyB FALSE FALSE
## keyC# FALSE FALSE
## keyD FALSE FALSE
## keyD# FALSE FALSE
## keyE FALSE FALSE
## keyF FALSE FALSE
## keyF# FALSE FALSE
## keyG FALSE FALSE
## keyG# FALSE FALSE
## modeMinor FALSE FALSE
## danceability FALSE FALSE
## valence FALSE FALSE
## energy FALSE FALSE
## acousticness FALSE FALSE
## instrumentalness FALSE FALSE
## liveness FALSE FALSE
## speechiness FALSE FALSE
## 1 subsets of each size up to 19
## Selection Algorithm: backward
## artist_count released_year released_month in_spotify_playlists
## 1 ( 1 ) " " " " " " "*"
## 2 ( 1 ) " " " " " " "*"
## 3 ( 1 ) " " " " " " "*"
## 4 ( 1 ) " " " " " " "*"
## 5 ( 1 ) "*" " " " " "*"
## 6 ( 1 ) "*" "*" " " "*"
## 7 ( 1 ) "*" "*" " " "*"
## 8 ( 1 ) "*" "*" " " "*"
## 9 ( 1 ) "*" "*" " " "*"
## 10 ( 1 ) "*" "*" " " "*"
## 11 ( 1 ) "*" "*" "*" "*"
## 12 ( 1 ) "*" "*" "*" "*"
## 13 ( 1 ) "*" "*" "*" "*"
## 14 ( 1 ) "*" "*" "*" "*"
## 15 ( 1 ) "*" "*" "*" "*"
## 16 ( 1 ) "*" "*" "*" "*"
## 17 ( 1 ) "*" "*" "*" "*"
## 18 ( 1 ) "*" "*" "*" "*"
## 19 ( 1 ) "*" "*" "*" "*"
## in_spotify_charts in_apple_playlists in_apple_charts in_deezer_charts
## 1 ( 1 ) " " " " " " " "
## 2 ( 1 ) " " "*" " " " "
## 3 ( 1 ) "*" "*" " " " "
## 4 ( 1 ) "*" "*" " " " "
## 5 ( 1 ) "*" "*" " " " "
## 6 ( 1 ) "*" "*" " " " "
## 7 ( 1 ) "*" "*" " " "*"
## 8 ( 1 ) "*" "*" " " "*"
## 9 ( 1 ) "*" "*" " " "*"
## 10 ( 1 ) "*" "*" "*" "*"
## 11 ( 1 ) "*" "*" "*" "*"
## 12 ( 1 ) "*" "*" "*" "*"
## 13 ( 1 ) "*" "*" "*" "*"
## 14 ( 1 ) "*" "*" "*" "*"
## 15 ( 1 ) "*" "*" "*" "*"
## 16 ( 1 ) "*" "*" "*" "*"
## 17 ( 1 ) "*" "*" "*" "*"
## 18 ( 1 ) "*" "*" "*" "*"
## 19 ( 1 ) "*" "*" "*" "*"
## bpm keyA keyA# keyB keyC# keyD keyD# keyE keyF keyF# keyG keyG#
## 1 ( 1 ) " " " " " " " " " " " " " " " " " " " " " " " "
## 2 ( 1 ) " " " " " " " " " " " " " " " " " " " " " " " "
## 3 ( 1 ) " " " " " " " " " " " " " " " " " " " " " " " "
## 4 ( 1 ) " " " " " " " " " " " " " " " " " " " " " " " "
## 5 ( 1 ) " " " " " " " " " " " " " " " " " " " " " " " "
## 6 ( 1 ) " " " " " " " " " " " " " " " " " " " " " " " "
## 7 ( 1 ) " " " " " " " " " " " " " " " " " " " " " " " "
## 8 ( 1 ) " " " " " " " " " " " " " " " " " " " " "*" " "
## 9 ( 1 ) " " " " " " " " " " " " " " " " " " " " "*" " "
## 10 ( 1 ) " " " " " " " " " " " " " " " " " " " " "*" " "
## 11 ( 1 ) " " " " " " " " " " " " " " " " " " " " "*" " "
## 12 ( 1 ) " " " " " " " " " " " " " " " " " " " " "*" " "
## 13 ( 1 ) " " " " "*" " " " " " " " " " " " " " " "*" " "
## 14 ( 1 ) " " " " "*" " " " " " " " " "*" " " " " "*" " "
## 15 ( 1 ) " " " " "*" " " " " " " " " "*" " " " " "*" " "
## 16 ( 1 ) " " " " "*" " " " " " " " " "*" " " " " "*" " "
## 17 ( 1 ) " " " " "*" " " " " " " "*" "*" " " " " "*" " "
## 18 ( 1 ) " " " " "*" "*" " " " " "*" "*" " " " " "*" " "
## 19 ( 1 ) " " " " "*" "*" " " " " "*" "*" "*" " " "*" " "
## modeMinor danceability valence energy acousticness instrumentalness
## 1 ( 1 ) " " " " " " " " " " " "
## 2 ( 1 ) " " " " " " " " " " " "
## 3 ( 1 ) " " " " " " " " " " " "
## 4 ( 1 ) " " " " " " "*" " " " "
## 5 ( 1 ) " " " " " " "*" " " " "
## 6 ( 1 ) " " " " " " "*" " " " "
## 7 ( 1 ) " " " " " " "*" " " " "
## 8 ( 1 ) " " " " " " "*" " " " "
## 9 ( 1 ) " " " " " " "*" "*" " "
## 10 ( 1 ) " " " " " " "*" "*" " "
## 11 ( 1 ) " " " " " " "*" "*" " "
## 12 ( 1 ) " " " " " " "*" "*" " "
## 13 ( 1 ) " " " " " " "*" "*" " "
## 14 ( 1 ) " " " " " " "*" "*" " "
## 15 ( 1 ) " " "*" " " "*" "*" " "
## 16 ( 1 ) " " "*" " " "*" "*" "*"
## 17 ( 1 ) " " "*" " " "*" "*" "*"
## 18 ( 1 ) " " "*" " " "*" "*" "*"
## 19 ( 1 ) " " "*" " " "*" "*" "*"
## liveness speechiness
## 1 ( 1 ) " " " "
## 2 ( 1 ) " " " "
## 3 ( 1 ) " " " "
## 4 ( 1 ) " " " "
## 5 ( 1 ) " " " "
## 6 ( 1 ) " " " "
## 7 ( 1 ) " " " "
## 8 ( 1 ) " " " "
## 9 ( 1 ) " " " "
## 10 ( 1 ) " " " "
## 11 ( 1 ) " " " "
## 12 ( 1 ) " " "*"
## 13 ( 1 ) " " "*"
## 14 ( 1 ) " " "*"
## 15 ( 1 ) " " "*"
## 16 ( 1 ) " " "*"
## 17 ( 1 ) " " "*"
## 18 ( 1 ) " " "*"
## 19 ( 1 ) " " "*"
Wnioski takie same jak przy selekcji nie-krokowwej.
Estymaty błędów testowych będą dokładne tylko jeśli wszystkie aspekty dopasowania modelu — w tym selekcję zmiennych — przeprowadzimy z użyciem wyłącznie zbioru uczącego.
n <- nrow(dataframe)
train <- sample(c(TRUE, FALSE), n, replace = TRUE)
test <- !train
dataframe_bs_v <- regsubsets(streams ~ . - track_name - artist.s._name - released_day - in_deezer_playlists - in_shazam_charts, data = dataframe[train,], nvmax = 19)
Niestety dla modeli zwracanych przez regsubsets nie ma
odpowiedniej metody predict(). Może ona mieć następującą
postać (funkcja model.matrix() tworzy macierz \(X\) dla podanych punktów).
predict.regsubsets <- function(object, newdata, id, ...) {
model_formula <- as.formula(object$call[[2]])
mat <- model.matrix(model_formula, newdata)
coefs <- coef(object, id = id)
mat[, names(coefs)] %*% coefs
}
Liczymy estymaty błędów
prediction_error <- function(i, model, subset) {
pred <- predict(model, dataframe[subset,], id = i)
mean((dataframe$streams[subset] - pred)^2)
}
val_errors <- sapply(1:19, prediction_error, model = dataframe_bs_v, subset = test)
val_errors
## [1] 1.580494e+17 1.092670e+17 1.085900e+17 1.068237e+17 1.075376e+17
## [6] 1.086365e+17 1.079801e+17 1.055658e+17 1.071522e+17 1.073144e+17
## [11] 1.070331e+17 1.092011e+17 1.088255e+17 1.094225e+17 1.095242e+17
## [16] 1.093596e+17 1.100975e+17 1.093578e+17 1.106267e+17
Optymalny model zawiera tylko 4 cechy
dataframe = read.csv("spotify-2023.csv")
names(dataframe)[names(dataframe) == "danceability_."] <- "danceability"
names(dataframe)[names(dataframe) == "valence_."] <- "valence"
names(dataframe)[names(dataframe) == "energy_."] <- "energy"
names(dataframe)[names(dataframe) == "acousticness_."] <- "acousticness"
names(dataframe)[names(dataframe) == "instrumentalness_."] <- "instrumentalness"
names(dataframe)[names(dataframe) == "liveness_."] <- "liveness"
names(dataframe)[names(dataframe) == "speechiness_."] <- "speechiness"
dataframe = transform(dataframe, streams = as.numeric(streams))
## Warning in eval(substitute(list(...)), `_data`, parent.frame()): NAs introduced
## by coercion
head(dataframe)
## track_name artist.s._name artist_count
## 1 Seven (feat. Latto) (Explicit Ver.) Latto, Jung Kook 2
## 2 LALA Myke Towers 1
## 3 vampire Olivia Rodrigo 1
## 4 Cruel Summer Taylor Swift 1
## 5 WHERE SHE GOES Bad Bunny 1
## 6 Sprinter Dave, Central Cee 2
## released_year released_month released_day in_spotify_playlists
## 1 2023 7 14 553
## 2 2023 3 23 1474
## 3 2023 6 30 1397
## 4 2019 8 23 7858
## 5 2023 5 18 3133
## 6 2023 6 1 2186
## in_spotify_charts streams in_apple_playlists in_apple_charts
## 1 147 141381703 43 263
## 2 48 133716286 48 126
## 3 113 140003974 94 207
## 4 100 800840817 116 207
## 5 50 303236322 84 133
## 6 91 183706234 67 213
## in_deezer_playlists in_deezer_charts in_shazam_charts bpm key mode
## 1 45 10 826 125 B Major
## 2 58 14 382 92 C# Major
## 3 91 14 949 138 F Major
## 4 125 12 548 170 A Major
## 5 87 15 425 144 A Minor
## 6 88 17 946 141 C# Major
## danceability valence energy acousticness instrumentalness liveness
## 1 80 89 83 31 0 8
## 2 71 61 74 7 0 10
## 3 51 32 53 17 0 31
## 4 55 58 72 11 0 11
## 5 65 23 80 14 63 11
## 6 92 66 58 19 0 8
## speechiness
## 1 4
## 2 4
## 3 6
## 4 15
## 5 6
## 6 24
library(glmnet)
## Loading required package: Matrix
## Loaded glmnet 4.1-8
dataframe <- na.omit(dataframe)
Obie omawiane na wykładzie metody regularyzacji są zaimplementowane w
funkcji glmnet() z pakietu glmnet.
Funkcja glmnet::glmnet() ma składnię odmienną od
lm() i jej podobnych. Dane wejściowe muszą być podane
odmiennie. Trzeba w szczególności samodzielnie skonstruować macierz
\(\mathbf{X}\)
X <- model.matrix(streams ~ . - track_name - artist.s._name - released_day - in_deezer_playlists - in_shazam_charts, data = dataframe)[, -1]
y <- dataframe$streams
Argument alpha funkcji glmnet() decyduje o
typie użytej regularyzacji: 0 oznacza regresję grzbietową,
a 1 lasso.
Wykonujemy regresję grzbietową dla jawnie określonych wartości \(\lambda\). Podany ciąg \(\lambda\) powinien być malejący. Funkcja
glmnet() domyślnie dokonuje standaryzacji zmiennych.
lambda_grid <- 10^seq(10, -2, length.out = 100)
fit_ridge <- glmnet(X, y, alpha = 0, lambda = lambda_grid)
Dla każdej wartości \(\lambda\) otrzymujemy zestaw estymat predyktorów dostępnych w postaci macierzy
dim(coef(fit_ridge))
## [1] 29 100
Można sprawdzić, że większe wartości \(\lambda\) dają mniejszą normę euklidesową współczynników (pomijamy wyraz wolny).
fit_ridge$lambda[50]
## [1] 11497.57
coef_ridge <- coef(fit_ridge)[, 50]
coef_ridge
## (Intercept) artist_count released_year
## -6.610929e+09 -3.454073e+07 3.416056e+06
## released_month in_spotify_playlists in_spotify_charts
## 3.842087e+06 3.565754e+04 3.545401e+06
## in_apple_playlists in_apple_charts in_deezer_charts
## 2.929355e+06 -4.342749e+05 -6.295116e+06
## bpm keyA keyA#
## -5.486203e+04 2.151831e+06 6.729063e+07
## keyB keyC# keyD
## 2.289709e+07 5.624426e+05 -1.327940e+07
## keyD# keyE keyF
## 3.669045e+07 5.070821e+07 2.182324e+07
## keyF# keyG keyG#
## -2.444262e+07 -4.475443e+07 6.016687e+06
## modeMinor danceability valence
## 6.418007e+06 -6.339861e+05 -3.622206e+05
## energy acousticness instrumentalness
## -1.236859e+06 7.331429e+05 -1.016489e+06
## liveness speechiness
## 3.975903e+05 -1.312226e+06
sqrt(sum(coef_ridge[-1]^2))
## [1] 116581007
Natomiast mniejsze wartości \(\lambda\) dają większą normę euklidesową współczynników
fit_ridge$lambda[70]
## [1] 43.28761
coef(fit_ridge)[, 70]
## (Intercept) artist_count released_year
## -6.611587e+09 -3.454144e+07 3.416377e+06
## released_month in_spotify_playlists in_spotify_charts
## 3.842226e+06 3.565787e+04 3.545692e+06
## in_apple_playlists in_apple_charts in_deezer_charts
## 2.929441e+06 -4.343660e+05 -6.296086e+06
## bpm keyA keyA#
## -5.486096e+04 2.154622e+06 6.729348e+07
## keyB keyC# keyD
## 2.289887e+07 5.621571e+05 -1.327873e+07
## keyD# keyE keyF
## 3.669365e+07 5.071000e+07 2.182568e+07
## keyF# keyG keyG#
## -2.444270e+07 -4.475287e+07 6.018289e+06
## modeMinor danceability valence
## 6.418594e+06 -6.339685e+05 -3.622077e+05
## energy acousticness instrumentalness
## -1.236851e+06 7.331943e+05 -1.016450e+06
## liveness speechiness
## 3.976559e+05 -1.312262e+06
sqrt(sum(coef(fit_ridge)[-1, 70]^2))
## [1] 116585042
Przy pomocy funkcji predict.glmnet() można uzyskać np.
wartości estymat współczynników dla nowej wartości \(\lambda\) (np. 50)
predict(fit_ridge, s = 50, type = "coefficients")
## 29 x 1 sparse Matrix of class "dgCMatrix"
## s1
## (Intercept) -6.611587e+09
## artist_count -3.454144e+07
## released_year 3.416377e+06
## released_month 3.842226e+06
## in_spotify_playlists 3.565787e+04
## in_spotify_charts 3.545692e+06
## in_apple_playlists 2.929440e+06
## in_apple_charts -4.343659e+05
## in_deezer_charts -6.296085e+06
## bpm -5.486096e+04
## keyA 2.154620e+06
## keyA# 6.729348e+07
## keyB 2.289886e+07
## keyC# 5.621573e+05
## keyD -1.327873e+07
## keyD# 3.669365e+07
## keyE 5.071000e+07
## keyF 2.182568e+07
## keyF# -2.444270e+07
## keyG -4.475287e+07
## keyG# 6.018288e+06
## modeMinor 6.418593e+06
## danceability -6.339685e+05
## valence -3.622077e+05
## energy -1.236851e+06
## acousticness 7.331943e+05
## instrumentalness -1.016450e+06
## liveness 3.976558e+05
## speechiness -1.312262e+06
Estymujemy testowy MSE
set.seed(1)
n <- nrow(X)
train <- sample(n, n / 2)
test <- -train
fit_ridge <- glmnet(X[train,], y[train], alpha = 0, lambda = lambda_grid,
thresh = 1e-12)
Dla \(\lambda = 4\)
pred_ridge <- predict(fit_ridge, s = 4, newx = X[test,])
mean((pred_ridge - y[test])^2)
## [1] 7.991473e+16
Testowy MSE dla modelu zerowego (sam wyraz wolny)
pred_null <- mean(y[train])
mean((pred_null - y[test])^2)
## [1] 2.800065e+17
Testowy MSE dla bardzo dużej wartości \(\lambda = 10^{10}\)
pred_ridge_big <- predict(fit_ridge, s = 1e10, newx = X[test,])
mean((pred_ridge_big - y[test])^2)
## [1] 2.363601e+17
[Jak wygląda porównanie?]
Testowy MSE dla \(\lambda = 0\) (metoda najmniejszych kwadratów)
pred_ridge_0 <- predict(fit_ridge, x = X[train,], y = y[train], s = 0,
newx = X[test,], exact = TRUE)
mean((pred_ridge_0 - y[test])^2)
## [1] 7.991473e+16
Porównanie estymat współczynników
lm(y ~ X, subset = train)
##
## Call:
## lm(formula = y ~ X, subset = train)
##
## Coefficients:
## (Intercept) Xartist_count Xreleased_year
## -7.701e+09 -3.791e+07 3.965e+06
## Xreleased_month Xin_spotify_playlists Xin_spotify_charts
## 3.872e+05 3.557e+04 3.706e+06
## Xin_apple_playlists Xin_apple_charts Xin_deezer_charts
## 2.931e+06 -2.831e+05 -6.724e+06
## Xbpm XkeyA XkeyA#
## -3.769e+05 -1.865e+07 8.209e+07
## XkeyB XkeyC# XkeyD
## 2.711e+06 1.401e+07 -1.708e+07
## XkeyD# XkeyE XkeyF
## 6.837e+07 1.088e+08 -2.711e+06
## XkeyF# XkeyG XkeyG#
## 8.864e+06 5.909e+06 4.258e+07
## XmodeMinor Xdanceability Xvalence
## 4.315e+07 8.028e+05 -1.345e+06
## Xenergy Xacousticness Xinstrumentalness
## -1.701e+06 1.159e+06 -1.953e+06
## Xliveness Xspeechiness
## -3.285e+05 -1.221e+06
predict(fit_ridge, x = X[train,], y = y[train], s = 0, exact = TRUE,
type = "coefficients")[1:20,]
## (Intercept) artist_count released_year
## -7.701100e+09 -3.791275e+07 3.964665e+06
## released_month in_spotify_playlists in_spotify_charts
## 3.871685e+05 3.556997e+04 3.706003e+06
## in_apple_playlists in_apple_charts in_deezer_charts
## 2.930568e+06 -2.831441e+05 -6.724399e+06
## bpm keyA keyA#
## -3.769107e+05 -1.864849e+07 8.208962e+07
## keyB keyC# keyD
## 2.711360e+06 1.400837e+07 -1.708006e+07
## keyD# keyE keyF
## 6.837425e+07 1.088185e+08 -2.711139e+06
## keyF# keyG
## 8.863846e+06 5.909349e+06
Wyliczenie optymalnej wartości \(\lambda\) przy pomocy walidacji krzyżowej
set.seed(1)
cv_out <- cv.glmnet(X[train,], y[train], alpha = 0)
plot(cv_out)
cv_out$lambda.min
## [1] 47707343
MSE dla optymalnego \(\lambda\)
pred_ridge_opt <- predict(fit_ridge, s = cv_out$lambda.min, newx = X[test,])
mean((pred_ridge_opt - y[test])^2)
## [1] 7.984013e+16
Estymaty współczynników dla optymalnego \(\lambda\)
fit_ridge_full <- glmnet(X, y, alpha = 0)
predict(fit_ridge_full, s = cv_out$lambda.min, type = "coefficients")
## 29 x 1 sparse Matrix of class "dgCMatrix"
## s1
## (Intercept) -4.052075e+09
## artist_count -3.261450e+07
## released_year 2.164381e+06
## released_month 3.142019e+06
## in_spotify_playlists 3.315303e+04
## in_spotify_charts 2.824766e+06
## in_apple_playlists 2.725871e+06
## in_apple_charts -1.499768e+05
## in_deezer_charts -3.733507e+06
## bpm -7.593654e+04
## keyA -4.495307e+06
## keyA# 6.159003e+07
## keyB 1.948485e+07
## keyC# 4.867638e+06
## keyD -1.279375e+07
## keyD# 3.134895e+07
## keyE 4.724777e+07
## keyF 1.675519e+07
## keyF# -2.081412e+07
## keyG -4.632411e+07
## keyG# 3.476638e+06
## modeMinor 4.111337e+06
## danceability -7.075421e+05
## valence -4.306323e+05
## energy -1.187928e+06
## acousticness 5.700174e+05
## instrumentalness -1.112572e+06
## liveness 1.838621e+05
## speechiness -1.174002e+06
Regularyzacja nic nie pomogła.
Dopasowujemy lasso dla ustalonej siatki parametrów regularyzacji
fit_lasso <- glmnet(X[train,], y[train], alpha = 1)
plot(fit_lasso, xvar = "lambda")
Wykonujemy walidację krzyżową i liczymy estymatę MSE
cv_out <- cv.glmnet(X[train,], y[train], alpha = 1)
plot(cv_out)
cv_out$lambda.min
## [1] 15262853
pred_lasso <- predict(fit_lasso, s = cv_out$lambda.min, newx = X[test,])
mean((pred_lasso - y[test])^2)
## [1] 7.652867e+16
[Jak wygląda porównanie z modelem zerowym, metodą najmniejszych kwadratów i regresją grzbietową?]
Estymaty współczynników dla optymalnego \(\lambda\)
fit_lasso_full <- glmnet(X, y, alpha = 1)
predict(fit_lasso_full, s = cv_out$lambda.min, type = "coefficients")[1:20,]
## (Intercept) artist_count released_year
## -1.338756e+09 -2.156827e+07 7.895307e+05
## released_month in_spotify_playlists in_spotify_charts
## 0.000000e+00 3.394806e+04 1.415269e+06
## in_apple_playlists in_apple_charts in_deezer_charts
## 2.639222e+06 0.000000e+00 0.000000e+00
## bpm keyA keyA#
## 0.000000e+00 0.000000e+00 1.192251e+06
## keyB keyC# keyD
## 0.000000e+00 0.000000e+00 0.000000e+00
## keyD# keyE keyF
## 0.000000e+00 7.476343e+05 0.000000e+00
## keyF# keyG
## 0.000000e+00 -1.254109e+07
dataframe = read.csv("spotify-2023.csv")
names(dataframe)[names(dataframe) == "danceability_."] <- "danceability"
names(dataframe)[names(dataframe) == "valence_."] <- "valence"
names(dataframe)[names(dataframe) == "energy_."] <- "energy"
names(dataframe)[names(dataframe) == "acousticness_."] <- "acousticness"
names(dataframe)[names(dataframe) == "instrumentalness_."] <- "instrumentalness"
names(dataframe)[names(dataframe) == "liveness_."] <- "liveness"
names(dataframe)[names(dataframe) == "speechiness_."] <- "speechiness"
dataframe = transform(dataframe, streams = as.numeric(streams))
## Warning in eval(substitute(list(...)), `_data`, parent.frame()): NAs introduced
## by coercion
dataframe = na.omit(dataframe)
head(dataframe)
## track_name artist.s._name artist_count
## 1 Seven (feat. Latto) (Explicit Ver.) Latto, Jung Kook 2
## 2 LALA Myke Towers 1
## 3 vampire Olivia Rodrigo 1
## 4 Cruel Summer Taylor Swift 1
## 5 WHERE SHE GOES Bad Bunny 1
## 6 Sprinter Dave, Central Cee 2
## released_year released_month released_day in_spotify_playlists
## 1 2023 7 14 553
## 2 2023 3 23 1474
## 3 2023 6 30 1397
## 4 2019 8 23 7858
## 5 2023 5 18 3133
## 6 2023 6 1 2186
## in_spotify_charts streams in_apple_playlists in_apple_charts
## 1 147 141381703 43 263
## 2 48 133716286 48 126
## 3 113 140003974 94 207
## 4 100 800840817 116 207
## 5 50 303236322 84 133
## 6 91 183706234 67 213
## in_deezer_playlists in_deezer_charts in_shazam_charts bpm key mode
## 1 45 10 826 125 B Major
## 2 58 14 382 92 C# Major
## 3 91 14 949 138 F Major
## 4 125 12 548 170 A Major
## 5 87 15 425 144 A Minor
## 6 88 17 946 141 C# Major
## danceability valence energy acousticness instrumentalness liveness
## 1 80 89 83 31 0 8
## 2 71 61 74 7 0 10
## 3 51 32 53 17 0 31
## 4 55 58 72 11 0 11
## 5 65 23 80 14 63 11
## 6 92 66 58 19 0 8
## speechiness
## 1 4
## 2 4
## 3 6
## 4 15
## 5 6
## 6 24
library(splines)
library(gam)
## Loading required package: foreach
## Loaded gam 1.22-3
fit_poly <- lm(streams ~ poly(energy, 4), data = dataframe)
summary(fit_poly)
##
## Call:
## lm(formula = streams ~ poly(energy, 4), data = dataframe)
##
## Residuals:
## Min 1Q Median 3Q Max
## -525902947 -372594811 -224310311 152574110 3198091315
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 514137425 18385120 27.965 <2e-16 ***
## poly(energy, 4)1 -455403405 567263618 -0.803 0.422
## poly(energy, 4)2 -494261264 567263618 -0.871 0.384
## poly(energy, 4)3 345346985 567263618 0.609 0.543
## poly(energy, 4)4 -526866803 567263618 -0.929 0.353
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 567300000 on 947 degrees of freedom
## Multiple R-squared: 0.002777, Adjusted R-squared: -0.001435
## F-statistic: 0.6592 on 4 and 947 DF, p-value: 0.6204
To samo z użyciem standardowej bazy wielomianów \(X, X^2, X^3, X^4\).
fit_poly_raw <- lm(streams ~ poly(energy, 4, raw = TRUE), data = dataframe)
summary(fit_poly_raw)
##
## Call:
## lm(formula = streams ~ poly(energy, 4, raw = TRUE), data = dataframe)
##
## Residuals:
## Min 1Q Median 3Q Max
## -525902947 -372594811 -224310311 152574110 3198091315
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -4.413e+08 7.947e+08 -0.555 0.579
## poly(energy, 4, raw = TRUE)1 7.654e+07 6.652e+07 1.150 0.250
## poly(energy, 4, raw = TRUE)2 -2.086e+06 1.962e+06 -1.063 0.288
## poly(energy, 4, raw = TRUE)3 2.388e+04 2.422e+04 0.986 0.325
## poly(energy, 4, raw = TRUE)4 -9.893e+01 1.065e+02 -0.929 0.353
##
## Residual standard error: 567300000 on 947 degrees of freedom
## Multiple R-squared: 0.002777, Adjusted R-squared: -0.001435
## F-statistic: 0.6592 on 4 and 947 DF, p-value: 0.6204
To samo, co powyżej, inaczej zapisane
fit_poly_raw <- lm(streams ~ energy + I(energy^2) + I(energy^3) + I(energy^4), data = dataframe)
summary(fit_poly_raw)
##
## Call:
## lm(formula = streams ~ energy + I(energy^2) + I(energy^3) + I(energy^4),
## data = dataframe)
##
## Residuals:
## Min 1Q Median 3Q Max
## -525902947 -372594811 -224310311 152574110 3198091315
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -4.413e+08 7.947e+08 -0.555 0.579
## energy 7.654e+07 6.652e+07 1.150 0.250
## I(energy^2) -2.086e+06 1.962e+06 -1.063 0.288
## I(energy^3) 2.388e+04 2.422e+04 0.986 0.325
## I(energy^4) -9.893e+01 1.065e+02 -0.929 0.353
##
## Residual standard error: 567300000 on 947 degrees of freedom
## Multiple R-squared: 0.002777, Adjusted R-squared: -0.001435
## F-statistic: 0.6592 on 4 and 947 DF, p-value: 0.6204
Obrazek dopasowania zawierający krzywe błędu standardowego.
energy_lims <- range(dataframe$energy)
energy_grid <- seq(energy_lims[1], energy_lims[2])
pred_poly <- predict(fit_poly, list(energy = energy_grid), se.fit = TRUE)
se_bands <- cbind(pred_poly$fit + 2 * pred_poly$se.fit,
pred_poly$fit - 2 * pred_poly$se.fit)
{
plot(dataframe$energy, dataframe$streams, col = "darkgrey", cex = 0.5, xlim = energy_lims)
lines(energy_grid, pred_poly$fit, col = "red", lwd = 2)
matlines(energy_grid, se_bands, col = "red", lty = "dashed")
}
Ponownie mmierzymy się z tym, że liczba odsłuchań jest trudna do przewidzenia.
Chcemy skonstruować klasyfikator z dwoma klasami: popularne piosenki (więcej niż 10^9 odsłuchań) i mniej popularne
fit_log_poly <- glm(I(streams > 1000000000) ~ poly(energy, 4), data = dataframe, family = binomial)
Funkcja predict.glm() standardowo zwraca szanse
logarytmiczne, co jest korzystne z punktu widzenia zobrazowania błędu
standardowego. Musimy jednak otrzymane wartości przekształcić funkcją
logistyczną.
pred_log_poly <- predict(fit_log_poly, list(energy = energy_grid), se.fit = TRUE)
pred_probs <- plogis(pred_log_poly$fit)
se_bands_logit <- cbind(pred_log_poly$fit + 2 * pred_log_poly$se.fit,
pred_log_poly$fit - 2 * pred_log_poly$se.fit)
se_bands <- plogis(se_bands_logit)
plot(dataframe$energy, I(dataframe$streams > 250), xlim = energy_lims, ylim = c(0, 1),
col = "darkgrey", cex = 0.5, ylab = "P(streams > 10^9 | energy)")
lines(energy_grid, pred_probs, col = "red", lwd = 2)
matlines(energy_grid, se_bands, lty = "dashed", col = "red")
Największą szanse na bycie popularnymi mają piosenki albo o stosunkowo niskiej energii (około 30) albo wysokiejh (około 85).
Dopasowanie funkcji schodkowej wykonujemy przy pomocy funkcji
cut() przekształcającej zmienną numeryczną w czynnik
uporządkowany.
table(cut(dataframe$energy, breaks = 4))
##
## (8.91,31] (31,53] (53,75] (75,97.1]
## 35 206 438 273
Samo dopasowanie wykonuje funkcja lm().
fit_step <- lm(streams ~ cut(energy, 4), data = dataframe)
pred_step <- predict(fit_step, list(energy = energy_grid), se.fit = TRUE)
se_bands <- cbind(pred_step$fit + 2 * pred_step$se.fit,
pred_step$fit - 2 * pred_step$se.fit)
plot(dataframe$energy, dataframe$streams, col = "darkgrey", cex = 0.5, xlim = energy_lims)
lines(energy_grid, pred_step$fit, col = "red", lwd = 2)
matlines(energy_grid, se_bands, col = "red", lty = "dashed")
Bazę regresyjnych funkcji sklejanych wylicza funkcja
bs() z pakietu splines. Domyślnym stopniem
funkcji sklejanych jest 3.
Regresja z użyciem funkcji sklejanych z ustalonymi węzłami.
fit_bs_knots <- lm(streams ~ bs(energy, knots = c(25, 40, 60)), data = dataframe)
pred_bs_knots <- predict(fit_bs_knots, list(energy = energy_grid), se.fit = TRUE)
plot(dataframe$energy, dataframe$streams, cex = 0.5, col = "darkgrey")
lines(energy_grid, pred_bs_knots$fit, col = "red", lwd = 2)
lines(energy_grid, pred_bs_knots$fit + 2 * pred_bs_knots$se.fit, col = "red",
lty = "dashed")
lines(energy_grid, pred_bs_knots$fit - 2 * pred_bs_knots$se.fit, col = "red",
lty = "dashed")
abline(v = c(25, 40, 60), lty = "dotted")
[Sprawdź jak ustawienie węzłów wpływa na dopasowany model.]
Dopasowanie modelu wykorzystującego funkcje sklejane o ustalonej liczbie stopni swobody. Węzły są rozmieszczane automatycznie.
fit_bs_dataframe <- lm(streams ~ bs(energy, df = 6), data = dataframe)
pred_bs_dataframe <- predict(fit_bs_dataframe, list(energy = energy_grid), se.fit = TRUE)
plot(dataframe$energy, dataframe$streams, cex = 0.5, col = "darkgrey")
lines(energy_grid, pred_bs_dataframe$fit, col = "red", lwd = 2)
lines(energy_grid, pred_bs_dataframe$fit + 2 * pred_bs_dataframe$se.fit, col = "red",
lty = "dashed")
lines(energy_grid, pred_bs_dataframe$fit - 2 * pred_bs_dataframe$se.fit, col = "red",
lty = "dashed")
bs_knots <- attr(bs(dataframe$energy, df = 6), "knots")
abline(v = bs_knots, lty = "dotted")
[Sprawdź jak liczba stopni swobody wpływa na dopasowany model.]
[Funkcja bs() akceptuje parametr
degree, który ustala stopień funkcji sklejanej. Sprawdź jak
w powyższych przykładach wyglądają funkcje sklejane innych
stopni.]
Bazę naturalnych sześciennych funkcji sklejanych wyznacza
funkcja ns() z pakietu splines.
fit_ns <- lm(streams ~ ns(energy, df = 4), data = dataframe)
pred_ns <- predict(fit_ns, list(energy = energy_grid), se.fit = TRUE)
plot(dataframe$energy, dataframe$streams, cex = 0.5, col = "darkgrey")
lines(energy_grid, pred_ns$fit, col = "red", lwd = 2)
lines(energy_grid, pred_ns$fit + 2 * pred_ns$se.fit, col = "red",
lty = "dashed")
lines(energy_grid, pred_ns$fit - 2 * pred_ns$se.fit, col = "red",
lty = "dashed")
abline(v = attr(ns(dataframe$energy, df = 4), "knots"), lty = "dotted")
Dopasowanie wygładzającej (sześciennej) funkcji sklejanej do danych
wykonuje funkcja smooth.spline(). Możemy dopasować
wygładzającą funkcję sklejaną o ustalonej liczbie stopni swobody (tu
16).
fit_smooth_dataframe <- smooth.spline(dataframe$energy, dataframe$streams, df = 16)
plot(dataframe$energy, dataframe$streams, cex = 0.5, col = "darkgrey")
lines(fit_smooth_dataframe, col = "red", lwd = 2)
Można też liczbę stopni swobody wyznaczyć automatycznie korzystając z walidacji krzyżowej.
fit_smooth_cv <- smooth.spline(dataframe$energy, dataframe$streams, cv = TRUE)
plot(dataframe$energy, dataframe$streams, cex = 0.5, col = "darkgrey")
lines(fit_smooth_cv, col = "red", lwd = 2)
Regresję lokalną (domyślnie wielomianami stopnia 2) wykonuje funkcja
loess(). Parametr funkcji o nazwie span
odpowiada parametrowi metody \(s\).
spans <- c(0.2, 0.5)
clrs <- c("red", "blue")
plot(dataframe$energy, dataframe$streams, cex = 0.5, col = "darkgrey")
for (i in 1:length(spans)) {
fit_loess <- loess(streams ~ energy, span = spans[i], data = dataframe)
pred_loess <- predict(fit_loess, data.frame(energy = energy_grid))
lines(energy_grid, pred_loess, col = clrs[i], lwd = 2)
}
legend("topright", legend = paste("s =", spans), col = clrs, lty = 1, lwd = 2)
To samo dla wielomianów stopnia 1.
spans <- c(0.2, 0.5)
clrs <- c("red", "blue")
plot(dataframe$energy, dataframe$streams, cex = 0.5, col = "darkgrey")
for (i in 1:length(spans)) {
fit_loess <- loess(streams ~ energy, span = spans[i], degree = 1, data = dataframe)
pred_loess <- predict(fit_loess, data.frame(energy = energy_grid))
lines(energy_grid, pred_loess, col = clrs[i], lwd = 2)
}
legend("topright", legend = paste("s =", spans), col = clrs, lty = 1, lwd = 2)
GAM będący rozwinięciem modelu liniowego może być uczony metodą
najmniejszych kwadratów przy pomocy funkcji lm().
fit_gam_ls <- lm(streams ~ ns(liveness, df = 4) + ns(energy, df = 5) + speechiness,
data = dataframe)
fit_gam_ls
##
## Call:
## lm(formula = streams ~ ns(liveness, df = 4) + ns(energy, df = 5) +
## speechiness, data = dataframe)
##
## Coefficients:
## (Intercept) ns(liveness, df = 4)1 ns(liveness, df = 4)2
## 490613650 -185679764 18880355
## ns(liveness, df = 4)3 ns(liveness, df = 4)4 ns(energy, df = 5)1
## -309513355 -408063207 192513613
## ns(energy, df = 5)2 ns(energy, df = 5)3 ns(energy, df = 5)4
## 229884824 64003008 450311130
## ns(energy, df = 5)5 speechiness
## -36553235 -6864911
summary(fit_gam_ls)
##
## Call:
## lm(formula = streams ~ ns(liveness, df = 4) + ns(energy, df = 5) +
## speechiness, data = dataframe)
##
## Residuals:
## Min 1Q Median 3Q Max
## -591343398 -363975051 -205534747 140592740 3155335490
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 490613650 252430028 1.944 0.052246 .
## ns(liveness, df = 4)1 -185679764 128958908 -1.440 0.150246
## ns(liveness, df = 4)2 18880356 139889184 0.135 0.892667
## ns(liveness, df = 4)3 -309513355 292197500 -1.059 0.289753
## ns(liveness, df = 4)4 -408063207 232934448 -1.752 0.080128 .
## ns(energy, df = 5)1 192513614 197400036 0.975 0.329689
## ns(energy, df = 5)2 229884824 234761136 0.979 0.327719
## ns(energy, df = 5)3 64003008 143395251 0.446 0.655455
## ns(energy, df = 5)4 450311130 476840107 0.944 0.345226
## ns(energy, df = 5)5 -36553236 154269130 -0.237 0.812751
## speechiness -6864911 1856933 -3.697 0.000231 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 563900000 on 941 degrees of freedom
## Multiple R-squared: 0.02098, Adjusted R-squared: 0.01058
## F-statistic: 2.017 on 10 and 941 DF, p-value: 0.02889
Ogólniejsze GAM są uczone przy pomocy algorytmu dopasowania
wstecznego w funkcji gam() z pakietu gam.
Pakiet gam zawiera też funkcje implementujące modele
nieparametryczne: s() reprezentującą wygładzające funkcje
sklejane i lo() reprezentującą lokalną regresję.
Dopasowanie modelu podobnego do poprzedniego, ale z użyciem wygładzających funkcji sklejanych.
fit_gam_bf <- gam(streams ~ s(liveness, df = 4) + s(energy, df = 5) + speechiness, data = dataframe)
summary(fit_gam_bf)
##
## Call: gam(formula = streams ~ s(liveness, df = 4) + s(energy, df = 5) +
## speechiness, data = dataframe)
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -577432754 -364782396 -205612599 138098435 3157061286
##
## (Dispersion Parameter for gaussian family taken to be 3.176346e+17)
##
## Null Deviance: 3.055818e+20 on 951 degrees of freedom
## Residual Deviance: 2.988941e+20 on 940.9999 degrees of freedom
## AIC: 41079.89
##
## Number of Local Scoring Iterations: NA
##
## Anova for Parametric Effects
## Df Sum Sq Mean Sq F value Pr(>F)
## s(liveness, df = 4) 1 6.8757e+17 6.8757e+17 2.1646 0.1415516
## s(energy, df = 5) 1 1.6909e+17 1.6909e+17 0.5324 0.4658003
## speechiness 1 4.3065e+18 4.3065e+18 13.5582 0.0002444 ***
## Residuals 941 2.9889e+20 3.1763e+17
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Anova for Nonparametric Effects
## Npar Df Npar F Pr(F)
## (Intercept)
## s(liveness, df = 4) 3 0.93479 0.4232
## s(energy, df = 5) 4 0.78633 0.5341
## speechiness
Wykres dla modelu dopasowanego funkcją gam().
par(mfrow = c(1, 3))
plot(fit_gam_bf, col = "red", se = TRUE)
Funkcja plot.Gam() działa też dla modeli metody
najmniejszych kwadratów, ale wówczas trzeba się do niej odwołać
jawnie.
par(mfrow = c(1, 3))
plot.Gam(fit_gam_ls, col = "red", se = TRUE)
Istnieje wersja funkcji anova() porównująca GAMs.
fit_gam_1 <- gam(streams ~ s(energy, df = 5) + speechiness, data = dataframe)
fit_gam_2 <- gam(streams ~ liveness + s(energy, df = 5) + speechiness, data = dataframe)
anova(fit_gam_1, fit_gam_2, fit_gam_bf, test = "F")
## Analysis of Deviance Table
##
## Model 1: streams ~ s(energy, df = 5) + speechiness
## Model 2: streams ~ liveness + s(energy, df = 5) + speechiness
## Model 3: streams ~ s(liveness, df = 4) + s(energy, df = 5) + speechiness
## Resid. Df Resid. Dev Df Deviance F Pr(>F)
## 1 945 3.0046e+20
## 2 944 2.9978e+20 1.0000 6.7786e+17 2.1341 0.1444
## 3 941 2.9889e+20 3.0001 8.8608e+17 0.9298 0.4257
Dopasowanie modelu wykorzystującego lokalną regresję.
fit_gam_lo <- gam(streams ~ s(liveness, df = 4) + lo(energy, span = 0.7) + speechiness,
data = dataframe)
summary(fit_gam_lo)
##
## Call: gam(formula = streams ~ s(liveness, df = 4) + lo(energy, span = 0.7) +
## speechiness, data = dataframe)
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -578148127 -365712192 -203669165 140780193 3161423534
##
## (Dispersion Parameter for gaussian family taken to be 3.173674e+17)
##
## Null Deviance: 3.055818e+20 on 951 degrees of freedom
## Residual Deviance: 2.995249e+20 on 943.7797 degrees of freedom
## AIC: 41076.34
##
## Number of Local Scoring Iterations: 1
##
## Anova for Parametric Effects
## Df Sum Sq Mean Sq F value Pr(>F)
## s(liveness, df = 4) 1.00 6.9971e+17 6.9971e+17 2.2047 0.1379209
## lo(energy, span = 0.7) 1.00 1.6834e+17 1.6834e+17 0.5304 0.4666070
## speechiness 1.00 4.2103e+18 4.2103e+18 13.2663 0.0002849 ***
## Residuals 943.78 2.9952e+20 3.1737e+17
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Anova for Nonparametric Effects
## Npar Df Npar F Pr(F)
## (Intercept)
## s(liveness, df = 4) 3.0 0.95347 0.4141
## lo(energy, span = 0.7) 1.2 0.94078 0.3501
## speechiness
par(mfrow = c(1, 3))
plot(fit_gam_lo, col = "green", se = TRUE)
Regresja logistyczna wykorzystująca GAM
fit_logistic_gam <- gam(I(streams > 1000000000) ~ liveness + s(energy, df = 5) + speechiness,
family = binomial, data = dataframe)
summary(fit_logistic_gam)
##
## Call: gam(formula = I(streams > 1e+09) ~ liveness + s(energy, df = 5) +
## speechiness, family = binomial, data = dataframe)
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.7059 -0.6277 -0.5900 -0.4728 2.2727
##
## (Dispersion Parameter for binomial family taken to be 1)
##
## Null Deviance: 836.0694 on 951 degrees of freedom
## Residual Deviance: 825.9861 on 944 degrees of freedom
## AIC: 841.9862
##
## Number of Local Scoring Iterations: NA
##
## Anova for Parametric Effects
## Df Sum Sq Mean Sq F value Pr(>F)
## liveness 1 0.74 0.7377 0.7345 0.39165
## s(energy, df = 5) 1 0.10 0.1032 0.1027 0.74865
## speechiness 1 6.11 6.1085 6.0820 0.01383 *
## Residuals 944 948.11 1.0044
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Anova for Nonparametric Effects
## Npar Df Npar Chisq P(Chi)
## (Intercept)
## liveness
## s(energy, df = 5) 4 2.1698 0.7046
## speechiness
par(mfrow = c(1, 3))
plot(fit_logistic_gam, col = "blue", se = TRUE)
Modele sprawdziły się słabo, trzeba było wybrać inny dataset albo przewidywać inne rzeczy :(
dataframe = read.csv("spotify-2023.csv")
names(dataframe)[names(dataframe) == "danceability_."] <- "danceability"
names(dataframe)[names(dataframe) == "valence_."] <- "valence"
names(dataframe)[names(dataframe) == "energy_."] <- "energy"
names(dataframe)[names(dataframe) == "acousticness_."] <- "acousticness"
names(dataframe)[names(dataframe) == "instrumentalness_."] <- "instrumentalness"
names(dataframe)[names(dataframe) == "liveness_."] <- "liveness"
names(dataframe)[names(dataframe) == "speechiness_."] <- "speechiness"
dataframe = transform(dataframe, liveness = as.numeric(liveness))
dataframe = na.omit(dataframe)
head(dataframe)
## track_name artist.s._name artist_count
## 1 Seven (feat. Latto) (Explicit Ver.) Latto, Jung Kook 2
## 2 LALA Myke Towers 1
## 3 vampire Olivia Rodrigo 1
## 4 Cruel Summer Taylor Swift 1
## 5 WHERE SHE GOES Bad Bunny 1
## 6 Sprinter Dave, Central Cee 2
## released_year released_month released_day in_spotify_playlists
## 1 2023 7 14 553
## 2 2023 3 23 1474
## 3 2023 6 30 1397
## 4 2019 8 23 7858
## 5 2023 5 18 3133
## 6 2023 6 1 2186
## in_spotify_charts streams in_apple_playlists in_apple_charts
## 1 147 141381703 43 263
## 2 48 133716286 48 126
## 3 113 140003974 94 207
## 4 100 800840817 116 207
## 5 50 303236322 84 133
## 6 91 183706234 67 213
## in_deezer_playlists in_deezer_charts in_shazam_charts bpm key mode
## 1 45 10 826 125 B Major
## 2 58 14 382 92 C# Major
## 3 91 14 949 138 F Major
## 4 125 12 548 170 A Major
## 5 87 15 425 144 A Minor
## 6 88 17 946 141 C# Major
## danceability valence energy acousticness instrumentalness liveness
## 1 80 89 83 31 0 8
## 2 71 61 74 7 0 10
## 3 51 32 53 17 0 31
## 4 55 58 72 11 0 11
## 5 65 23 80 14 63 11
## 6 92 66 58 19 0 8
## speechiness
## 1 4
## 2 4
## 3 6
## 4 15
## 5 6
## 6 24
library(tree)
library(randomForest)
## randomForest 4.7-1.1
## Type rfNews() to see new features/changes/bug fixes.
library(gbm)
## Loaded gbm 2.1.9
## This version of gbm is no longer under development. Consider transitioning to gbm3, https://github.com/gbm-developers/gbm3
Drzewa decyzyjne są zaimplementowane w pakiecie tree
(nieco odmienna implementacja dostępna jest w pakiecie
rpart).
Będziemy klasyfikować obserwacje do dwóch klas: wykonywane na zywo i nie wykonywane na żywo. Uzupełniamy zbiór danych
live <- factor(ifelse(dataframe$liveness <= 50, "No", "Yes"))
dataframeLive <- data.frame(dataframe, live)
Budujemy drzewo klasyfikacyjne do predykcji live na
podstawie pozostałych zmiennych (poza liveness i kilku
innych).
liveness_high_tree <- tree(live ~ . - track_name - artist.s._name - released_day - in_deezer_playlists - in_shazam_charts - liveness, data = dataframeLive)
## Warning in tree(live ~ . - track_name - artist.s._name - released_day - : NAs
## introduced by coercion
summary(liveness_high_tree)
##
## Classification tree:
## tree(formula = live ~ . - track_name - artist.s._name - released_day -
## in_deezer_playlists - in_shazam_charts - liveness, data = dataframeLive)
## Variables actually used in tree construction:
## [1] "energy" "in_spotify_charts" "bpm"
## [4] "valence" "in_apple_playlists" "streams"
## [7] "in_apple_charts" "danceability" "acousticness"
## [10] "artist_count" "released_year" "speechiness"
## [13] "released_month" "in_spotify_playlists"
## Number of terminal nodes: 28
## Residual mean deviance: 0.09891 = 91.49 / 925
## Misclassification error rate: 0.02623 = 25 / 953
Przedstawienie graficzne dopasowanego modelu
{
plot(liveness_high_tree)
text(liveness_high_tree, pretty = 0)
}
Widzimy, że zdecydowanie łatwiej zaklasyfikować piosenkę jako niewykonywaną na żywo. Albo piosneek z dużym liveness (<50) jest mało. W każdym razie drzewo znalazło pewien wąski zakres pozostałych paramterów który pozwala wykryć takie piosenki.
Więcej informacji podaje funkcja print.tree()
liveness_high_tree
## node), split, n, deviance, yval, (yprob)
## * denotes terminal node
##
## 1) root 953 273.400 No ( 0.967471 0.032529 )
## 2) energy < 60.5 364 34.770 No ( 0.991758 0.008242 )
## 4) in_spotify_charts < 34.5 332 13.610 No ( 0.996988 0.003012 )
## 8) bpm < 83.5 33 8.962 No ( 0.969697 0.030303 )
## 16) bpm < 81.5 28 0.000 No ( 1.000000 0.000000 ) *
## 17) bpm > 81.5 5 5.004 No ( 0.800000 0.200000 ) *
## 9) bpm > 83.5 299 0.000 No ( 1.000000 0.000000 ) *
## 5) in_spotify_charts > 34.5 32 14.960 No ( 0.937500 0.062500 )
## 10) valence < 37 9 9.535 No ( 0.777778 0.222222 ) *
## 11) valence > 37 23 0.000 No ( 1.000000 0.000000 ) *
## 3) energy > 60.5 589 225.200 No ( 0.952462 0.047538 )
## 6) in_apple_playlists < 75.5 400 192.400 No ( 0.935000 0.065000 )
## 12) bpm < 159 360 186.700 No ( 0.927778 0.072222 )
## 24) bpm < 153.5 346 158.400 No ( 0.939306 0.060694 )
## 48) streams < 1.12543e+09 332 139.900 No ( 0.945783 0.054217 )
## 96) in_apple_charts < 0.5 35 0.000 No ( 1.000000 0.000000 ) *
## 97) in_apple_charts > 0.5 297 135.800 No ( 0.939394 0.060606 )
## 194) danceability < 91.5 291 123.900 No ( 0.945017 0.054983 )
## 388) in_spotify_charts < 2.5 137 81.360 No ( 0.912409 0.087591 )
## 776) acousticness < 54 130 65.430 No ( 0.930769 0.069231 )
## 1552) in_apple_charts < 21.5 82 56.740 No ( 0.890244 0.109756 )
## 3104) bpm < 126.5 63 24.120 No ( 0.952381 0.047619 )
## 6208) bpm < 93 21 17.220 No ( 0.857143 0.142857 )
## 12416) artist_count < 1.5 12 0.000 No ( 1.000000 0.000000 ) *
## 12417) artist_count > 1.5 9 11.460 No ( 0.666667 0.333333 ) *
## 6209) bpm > 93 42 0.000 No ( 1.000000 0.000000 ) *
## 3105) bpm > 126.5 19 23.700 No ( 0.684211 0.315789 )
## 6210) released_year < 2022.5 13 17.940 No ( 0.538462 0.461538 )
## 12420) in_apple_charts < 5.5 6 5.407 No ( 0.833333 0.166667 ) *
## 12421) in_apple_charts > 5.5 7 8.376 Yes ( 0.285714 0.714286 ) *
## 6211) released_year > 2022.5 6 0.000 No ( 1.000000 0.000000 ) *
## 1553) in_apple_charts > 21.5 48 0.000 No ( 1.000000 0.000000 ) *
## 777) acousticness > 54 7 9.561 No ( 0.571429 0.428571 ) *
## 389) in_spotify_charts > 2.5 154 37.100 No ( 0.974026 0.025974 )
## 778) speechiness < 4.5 54 28.520 No ( 0.925926 0.074074 )
## 1556) in_apple_playlists < 26.5 33 0.000 No ( 1.000000 0.000000 ) *
## 1557) in_apple_playlists > 26.5 21 20.450 No ( 0.809524 0.190476 )
## 3114) released_month < 4.5 8 0.000 No ( 1.000000 0.000000 ) *
## 3115) released_month > 4.5 13 16.050 No ( 0.692308 0.307692 )
## 6230) in_apple_playlists < 50 8 11.090 No ( 0.500000 0.500000 ) *
## 6231) in_apple_playlists > 50 5 0.000 No ( 1.000000 0.000000 ) *
## 779) speechiness > 4.5 100 0.000 No ( 1.000000 0.000000 ) *
## 195) danceability > 91.5 6 7.638 No ( 0.666667 0.333333 ) *
## 49) streams > 1.12543e+09 13 14.050 No ( 0.769231 0.230769 )
## 98) valence < 45.5 6 8.318 No ( 0.500000 0.500000 ) *
## 99) valence > 45.5 7 0.000 No ( 1.000000 0.000000 ) *
## 25) bpm > 153.5 14 18.250 No ( 0.642857 0.357143 )
## 50) energy < 81 7 0.000 No ( 1.000000 0.000000 ) *
## 51) energy > 81 7 8.376 Yes ( 0.285714 0.714286 ) *
## 13) bpm > 159 40 0.000 No ( 1.000000 0.000000 ) *
## 7) in_apple_playlists > 75.5 189 22.170 No ( 0.989418 0.010582 )
## 14) released_month < 11.5 178 0.000 No ( 1.000000 0.000000 ) *
## 15) released_month > 11.5 11 10.430 No ( 0.818182 0.181818 )
## 30) in_spotify_playlists < 8983.5 6 0.000 No ( 1.000000 0.000000 ) *
## 31) in_spotify_playlists > 8983.5 5 6.730 No ( 0.600000 0.400000 ) *
Istotne są energia, in_apple_playlists i bpm.
Metodą zbioru walidacyjnego estymujemy błąd testowy dla drzewa klasyfikacyjnego w rozważanym problemie.
set.seed(1)
n <- nrow(dataframeLive)
train <- sample(n, n / 2)
test <- -train
liveness_high_tree <- tree(live ~ . - track_name - artist.s._name - released_day - in_deezer_playlists - in_shazam_charts - liveness, data = dataframeLive, subset = train)
## Warning in tree(live ~ . - track_name - artist.s._name - released_day - : NAs
## introduced by coercion
tree_class <- predict(liveness_high_tree, newdata = dataframeLive[test,], type = "class")
## Warning in pred1.tree(object, tree.matrix(newdata)): NAs introduced by coercion
table(tree_class, dataframeLive$live[test])
##
## tree_class No Yes
## No 457 17
## Yes 2 1
mean(tree_class != dataframeLive$live[test])
## [1] 0.03983229
Duże drzewo \(T_0\) dla
zbioru uczącego dataframeLive[train,]
{
plot(liveness_high_tree)
text(liveness_high_tree, pretty = 0)
}
Do znalezienia optymalnego poddrzewa stosujemy przycinanie stosowane złożonością. Przy pomocy CV konstruujemy ciąg poddrzew wyznaczony przez malejącą złożoność.
set.seed(1)
liveness_high_cv <- cv.tree(liveness_high_tree, FUN = prune.misclass)
## Warning in tree(model = m[rand != i, , drop = FALSE]): NAs introduced by
## coercion
## Warning in pred1.tree(tree, tree.matrix(nd)): NAs introduced by coercion
## Warning in tree(model = m[rand != i, , drop = FALSE]): NAs introduced by
## coercion
## Warning in pred1.tree(tree, tree.matrix(nd)): NAs introduced by coercion
## Warning in tree(model = m[rand != i, , drop = FALSE]): NAs introduced by
## coercion
## Warning in pred1.tree(tree, tree.matrix(nd)): NAs introduced by coercion
## Warning in tree(model = m[rand != i, , drop = FALSE]): NAs introduced by
## coercion
## Warning in pred1.tree(tree, tree.matrix(nd)): NAs introduced by coercion
## Warning in tree(model = m[rand != i, , drop = FALSE]): NAs introduced by
## coercion
## Warning in pred1.tree(tree, tree.matrix(nd)): NAs introduced by coercion
## Warning in tree(model = m[rand != i, , drop = FALSE]): NAs introduced by
## coercion
## Warning in pred1.tree(tree, tree.matrix(nd)): NAs introduced by coercion
## Warning in tree(model = m[rand != i, , drop = FALSE]): NAs introduced by
## coercion
## Warning in pred1.tree(tree, tree.matrix(nd)): NAs introduced by coercion
## Warning in tree(model = m[rand != i, , drop = FALSE]): NAs introduced by
## coercion
## Warning in pred1.tree(tree, tree.matrix(nd)): NAs introduced by coercion
## Warning in tree(model = m[rand != i, , drop = FALSE]): NAs introduced by
## coercion
## Warning in pred1.tree(tree, tree.matrix(nd)): NAs introduced by coercion
## Warning in tree(model = m[rand != i, , drop = FALSE]): NAs introduced by
## coercion
## Warning in pred1.tree(tree, tree.matrix(nd)): NAs introduced by coercion
liveness_high_cv
## $size
## [1] 14 7 1
##
## $dev
## [1] 24 24 16
##
## $k
## [1] -Inf 0.0 0.5
##
## $method
## [1] "misclass"
##
## attr(,"class")
## [1] "prune" "tree.sequence"
plot(liveness_high_cv$size, liveness_high_cv$dev, type = "b")
Składowa liveness_high_cv$dev zawiera liczbę błędów CV.
Przycinamy drzewo \(T_0\) do poddrzewa
z najmniejszym poziomem błędów CV.
size_opt <- liveness_high_cv$size[which.min(liveness_high_cv$dev)]
# niestety size_opt = 1 i nie da się tego wyświetlić
liveness_high_pruned <- prune.misclass(liveness_high_tree, best = 5)
{
plot(liveness_high_pruned)
text(liveness_high_pruned, pretty = 0)
}
Optymalne drzewo miało tylko jeden poziom i z jakiegoś powodu nie mogło zostać narysowane, rysuję wiec takie. Zgadza się to z intuicją co do istotnych cech.
Testowy poziom błędów dla optymalnego poddrzewa.
pruned_class <- predict(liveness_high_pruned, newdata = dataframeLive[test,],
type = "class")
## Warning in pred1.tree(object, tree.matrix(newdata)): NAs introduced by coercion
table(pruned_class, dataframeLive$live[test])
##
## pruned_class No Yes
## No 457 17
## Yes 2 1
mean(pruned_class != dataframeLive$live[test])
## [1] 0.03983229
liveness_tree <- tree(liveness ~ . - track_name - artist.s._name - released_day - in_deezer_playlists - in_shazam_charts, data = dataframe)
## Warning in tree(liveness ~ . - track_name - artist.s._name - released_day - :
## NAs introduced by coercion
summary(liveness_tree)
##
## Regression tree:
## tree(formula = liveness ~ . - track_name - artist.s._name - released_day -
## in_deezer_playlists - in_shazam_charts, data = dataframe)
## Variables actually used in tree construction:
## [1] "energy" "bpm"
## Number of terminal nodes: 3
## Residual mean deviance: 177.7 = 168900 / 950
## Distribution of residuals:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -47.750 -8.759 -5.759 0.000 5.241 79.240
Deviance oznacza tutaj RSS. Przedstawienie drzewa
{
liveness_tree
plot(liveness_tree)
text(liveness_tree)
}
Ponownie istotna jest energia i BPM, ma sens pod kątem wystąpień na żywo.
Metodą zbioru walidacyjnego szacujemy błąd testowy.
set.seed(1)
n <- nrow(dataframe)
train <- sample(n, n / 2)
test <- -train
liveness_tree <- tree(liveness ~ . - track_name - artist.s._name - released_day - in_deezer_playlists - in_shazam_charts, data = dataframe, subset = train)
## Warning in tree(liveness ~ . - track_name - artist.s._name - released_day - :
## NAs introduced by coercion
liveness_pred <- predict(liveness_tree, newdata = dataframe[test,])
## Warning in pred1.tree(object, tree.matrix(newdata)): NAs introduced by coercion
mean((liveness_pred - dataframe$liveness[test])^2)
## [1] 230.8634
{
plot(liveness_tree)
text(liveness_tree)
}
Wyznaczamy optymalne poddrzewo metodą przycinania sterowanego złożonością.
liveness_cv <- cv.tree(liveness_tree)
## Warning in tree(model = m[rand != i, , drop = FALSE]): NAs introduced by
## coercion
## Warning in pred1.tree(tree, tree.matrix(nd)): NAs introduced by coercion
## Warning in tree(model = m[rand != i, , drop = FALSE]): NAs introduced by
## coercion
## Warning in pred1.tree(tree, tree.matrix(nd)): NAs introduced by coercion
## Warning in tree(model = m[rand != i, , drop = FALSE]): NAs introduced by
## coercion
## Warning in pred1.tree(tree, tree.matrix(nd)): NAs introduced by coercion
## Warning in tree(model = m[rand != i, , drop = FALSE]): NAs introduced by
## coercion
## Warning in pred1.tree(tree, tree.matrix(nd)): NAs introduced by coercion
## Warning in tree(model = m[rand != i, , drop = FALSE]): NAs introduced by
## coercion
## Warning in pred1.tree(tree, tree.matrix(nd)): NAs introduced by coercion
## Warning in tree(model = m[rand != i, , drop = FALSE]): NAs introduced by
## coercion
## Warning in pred1.tree(tree, tree.matrix(nd)): NAs introduced by coercion
## Warning in tree(model = m[rand != i, , drop = FALSE]): NAs introduced by
## coercion
## Warning in pred1.tree(tree, tree.matrix(nd)): NAs introduced by coercion
## Warning in tree(model = m[rand != i, , drop = FALSE]): NAs introduced by
## coercion
## Warning in pred1.tree(tree, tree.matrix(nd)): NAs introduced by coercion
## Warning in tree(model = m[rand != i, , drop = FALSE]): NAs introduced by
## coercion
## Warning in pred1.tree(tree, tree.matrix(nd)): NAs introduced by coercion
## Warning in tree(model = m[rand != i, , drop = FALSE]): NAs introduced by
## coercion
## Warning in pred1.tree(tree, tree.matrix(nd)): NAs introduced by coercion
plot(liveness_cv$size, liveness_cv$dev, type = "b")
Najlepiej działa drzewo o rozmiarze 3.
Przycinanie drzewa \(T_0\) do
żądanego poziomu realizuje w tym przypadku funkcja
prune.tree().
liveness_pruned <- prune.tree(liveness_tree, best = 4)
plot(liveness_pruned)
text(liveness_pruned)
[Oblicz estymatę błędu testowego dla poddrzewa z 4 liśćmi.]
Bagging i lasy losowe implementowane są przez pakiet
randomForest. Oczywiście bagging jest szczególnym
przypadkiem lasu losowego.
Bagging dla regresji liveness względem wszystkich
pozostałych w zbiorze dataframe.
liveness_bag <- randomForest(liveness ~ . - track_name - artist.s._name - released_day - in_deezer_playlists - in_shazam_charts, data = dataframe, mtry = 13, importance = TRUE)
liveness_bag
##
## Call:
## randomForest(formula = liveness ~ . - track_name - artist.s._name - released_day - in_deezer_playlists - in_shazam_charts, data = dataframe, mtry = 13, importance = TRUE)
## Type of random forest: regression
## Number of trees: 500
## No. of variables tried at each split: 13
##
## Mean of squared residuals: 197.1739
## % Var explained: -4.99
Wykres błędu OOB względem liczby drzew
plot(liveness_bag, type = "l")
Dodawanie więcej niż 100/200 drzew nie poprawia już wyniku.
W przypadku regresji błąd MSE OOB dostępny jest w składowej
mse obiektu klasy randomForest. W przypadku
klasyfikacji wyniki analizy danych OOB dostępne są w składowych
err.rate (proporcja błędów) i confusion
(tabela pomyłek).
Wyznaczenie ważności predyktorów
importance(liveness_bag)
## %IncMSE IncNodePurity
## artist_count 5.67845283 4269.1055
## released_year 4.50259927 5008.6306
## released_month 1.09900812 7015.6625
## in_spotify_playlists 7.88154381 16357.8599
## in_spotify_charts 6.00499571 6282.4111
## streams -0.91095539 12706.5470
## in_apple_playlists 6.74829503 10859.3255
## in_apple_charts 2.38574303 9361.9722
## in_deezer_charts 3.67634998 4039.2470
## bpm 5.01592523 14941.9088
## key -0.00522467 6909.8455
## mode 0.94644763 1613.6739
## danceability 6.62912780 13196.8077
## valence 3.68052570 11567.1685
## energy 19.15120059 17393.8035
## acousticness 15.28882140 16927.8176
## instrumentalness -1.35875602 670.8521
## speechiness 0.40185385 7738.9384
I stosowny obrazek
varImpPlot(liveness_bag)
Widzimy, że zdecydowanie najważneijsza jest energia i akustyczność. Potem także liczba playlist do których należy piosenka. Nie liczy się klucz czy instrumentalność.
Oszacowanie błędu testowego dla poprzednio wyznaczonego zbioru walidacyjnego.
set.seed(2)
liveness_bag <- randomForest(liveness ~ . - track_name - artist.s._name - released_day - in_deezer_playlists - in_shazam_charts, data = dataframe, subset = train, mtry = 13,
importance = TRUE)
liveness_pred_bag <- predict(liveness_bag, newdata = dataframe[test,])
mean((liveness_pred_bag - dataframe$liveness[test])^2)
## [1] 200.766
varImpPlot(liveness_bag)
Cechy delikatnie się zmieniły.
Powyższe dla mniejszej liczby hodowanych drzew
set.seed(2)
liveness_bag_s <- randomForest(liveness ~ ., data = dataframe, subset = train, mtry = 13,
importance = TRUE, ntree = 25)
liveness_pred_bag_s <- predict(liveness_bag_s, newdata = dataframe[test,])
mean((liveness_pred_bag_s - dataframe$liveness[test])^2)
## [1] 207.2359
Domyślna wartość parametru mtry to \(\sqrt{p}\) dla regresji i \(p/3\) dla klasyfikacji.
Oszacowanie błędu testowego dla poprzednio wyznaczonego zbioru walidacyjnego.
set.seed(2)
liveness_rf <- randomForest(liveness ~ . - track_name - artist.s._name - released_day - in_deezer_playlists - in_shazam_charts, data = dataframe, subset = train,
importance = TRUE)
liveness_pred_rf <- predict(liveness_rf, newdata = dataframe[test,])
mean((liveness_pred_rf - dataframe$liveness[test])^2)
## [1] 198.8062
Wynik nieco lepszy niż przy baggingu
Powyższe dla ręcznie ustawionego parametru \(m\) (czyli mtry).
set.seed(2)
liveness_rf <- randomForest(liveness ~ . - track_name - artist.s._name - released_day - in_deezer_playlists - in_shazam_charts, data = dataframe, subset = train, mtry = 6,
importance = TRUE)
liveness_pred_rf <- predict(liveness_rf, newdata = dataframe[test,])
mean((liveness_pred_rf - dataframe$liveness[test])^2)
## [1] 198.8062
Używamy algorytmów boostingu dla drzew decyzyjnych zaimplementowanych
w pakiecie gbm. Inną implementację — wydajną i często
pojawiającą się w zastosowaniach — zawiera pakiet
xgboost.
Boosting dla regresji liveness względem pozostałych
zmiennych ze zbioru dataframe. Funkcją dopasowującą model
jest gbm() z istotnymi parametrami:
distribution: "gaussian" dla regresji z
RSS, "bernoulli" dla regresji typu logistycznego;
n.trees: liczba hodowanych drzew (\(B\));
interaction.depth: głębokość interakcji (\(d\));
shrinkage: parametr spowalniający uczenie (\(\lambda\)).
liveness_boost <- gbm(liveness ~ . - track_name - artist.s._name - released_day - in_deezer_playlists - in_shazam_charts - streams - key - mode, data = dataframe, distribution = "gaussian",
n.trees = 5000, interaction.depth = 4)
liveness_boost
## gbm(formula = liveness ~ . - track_name - artist.s._name - released_day -
## in_deezer_playlists - in_shazam_charts - streams - key -
## mode, distribution = "gaussian", data = dataframe, n.trees = 5000,
## interaction.depth = 4)
## A gradient boosted model with gaussian loss function.
## 5000 iterations were performed.
## There were 15 predictors of which 15 had non-zero influence.
Funkcja summary.gbm() wyznacza ważność predyktorów i
(domyślnie) wykonuje odpowiedni wykres.
summary(liveness_boost)
## var rel.inf
## in_spotify_playlists in_spotify_playlists 12.084400
## energy energy 10.999656
## bpm bpm 9.768195
## valence valence 9.478973
## danceability danceability 9.103650
## acousticness acousticness 8.980394
## in_apple_playlists in_apple_playlists 8.670737
## in_apple_charts in_apple_charts 7.806939
## speechiness speechiness 5.316668
## in_spotify_charts in_spotify_charts 5.109875
## released_month released_month 4.364729
## released_year released_year 3.110387
## in_deezer_charts in_deezer_charts 2.509551
## artist_count artist_count 1.874445
## instrumentalness instrumentalness 0.821399
Pokrywa się ze wcześniejszymi odkryciami, liczba wystąpień na playlistach Spotify, energia i bpm najważneiszjymi cechami.
Funkcja plot.gbm() wykonuje wykresy częściowej
zaleźności.
plot(liveness_boost, i.var = "acousticness")
plot(liveness_boost, i.var = "energy")
plot(liveness_boost, i.var = c("acousticness", "energy"))
Oszacowanie błędu testowego dla poprzednio wyznaczonego zbioru walidacyjnego.
set.seed(2)
liveness_boost <- gbm(liveness ~ . - track_name - artist.s._name - released_day - in_deezer_playlists - in_shazam_charts - streams - key - mode, data = dataframe[train,], distribution = "gaussian",
interaction.depth = 4, n.trees = 5000)
liveness_pred_boost <- predict(liveness_boost, newdata = dataframe[test,], n.trees = 5000)
mean((liveness_pred_boost - dataframe$liveness[test])^2)
## [1] 259.4762
To samo dla \(\lambda = 0.01\).
set.seed(2)
liveness_boost <- gbm(liveness ~ . - track_name - artist.s._name - released_day - in_deezer_playlists - in_shazam_charts - streams - key - mode, data = dataframe[train,], distribution = "gaussian",
interaction.depth = 4, n.trees = 5000, shrinkage = 0.01)
liveness_pred_boost <- predict(liveness_boost, newdata = dataframe[test,], n.trees = 5000)
mean((liveness_pred_boost - dataframe$liveness[test])^2)
## [1] 223.2554
To samo dla \(d = 1\).
set.seed(2)
liveness_boost <- gbm(liveness ~ . - track_name - artist.s._name - released_day - in_deezer_playlists - in_shazam_charts - streams - key - mode, data = dataframe[train,], distribution = "gaussian",
n.trees = 5000, shrinkage = 0.01)
liveness_pred_boost <- predict(liveness_boost, newdata = dataframe[test,], n.trees = 5000)
mean((liveness_pred_boost - dataframe$liveness[test])^2)
## [1] 207.1783
Wyniki nieco gorsze niż dla prostych lasów losowych.